lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weiwei Wang <ww.wang...@gmail.com>
Subject Re: Recover special terms from StandardTokenizer
Date Sun, 13 Dec 2009 03:12:44 GMT
I use Luke to check the result and find only c exists as a term, no
cplusplus found in the index

On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang <ww.wang.cs@gmail.com> wrote:

> Thanks, Koji, I followed your advice and change my analyzer as shown below:
> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> RECOVERY_MAP.add("c++","cplusplus$");
> CharFilter filter = new LowercaseCharFilter(reader);
> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
> filter);
> tokenStream.setMaxTokenLength(maxTokenLength);
> TokenStream result = new StandardFilter(tokenStream);
> result = new LowerCaseFilter(result);
> result = new StopFilter(enableStopPositionIncrements, result, stopSet);
> result = new SnowballFilter(result, STEMMER);
>
> I use the same analyzer in the search side. As you know, this analyzer can
> token c++ as cplusplus, for this reason, it seems I can search c++ with
> the same analyzer because it is also tokenized as cplusplus.
>
> I tested it on as string c++c++, however, when i search c++ on the built
> index, nothing is returned.
>
>  I do not know what's wrong with my code. Waiting for your replay
>
>
>
>
>
> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang <ww.wang.cs@gmail.com> wrote:
>
>> Thanks, Koji
>>
>>
>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi <koji@r.email.ne.jp>wrote:
>>
>>> MappingCharFilter can be used to convert c++ to cplusplus.
>>>
>>> Koji
>>>
>>> --
>>> http://www.rondhuit.com/en/
>>>
>>>
>>>
>>> Anshum wrote:
>>>
>>>> How about getting the original token stream and then converting c++ to
>>>> cplusplus or anyother such transform. Or perhaps you might look at
>>>> using/extending(in the non java sense) some other tokenized!
>>>>
>>>> --
>>>> Anshum Gupta
>>>> Naukri Labs!
>>>> http://ai-cafe.blogspot.com
>>>>
>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>> distinction is yours to draw............
>>>>
>>>>
>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww.wang.cs@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> Hi, all,
>>>>>    I designed a ftp search engine based on Lucene. I did a few
>>>>> modifications to the StandardTokenizer.
>>>>> My problem is:
>>>>>  C++ is tokenized as c from StandardTokenizer and I want to recover it
>>>>> from
>>>>> the TokenStream from StandardTokenizer
>>>>>
>>>>> What should I do?
>>>>>
>>>>> --
>>>>> Weiwei Wang
>>>>> Alex Wang
>>>>> 王巍巍
>>>>> Room 403, Mengmin Wei Building
>>>>> Computer Science Department
>>>>> Gulou Campus of Nanjing University
>>>>> Nanjing, P.R.China, 210093
>>>>>
>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>> --
>> Weiwei Wang
>> Alex Wang
>> 王巍巍
>> Room 403, Mengmin Wei Building
>> Computer Science Department
>> Gulou Campus of Nanjing University
>> Nanjing, P.R.China, 210093
>>
>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message