lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weiwei Wang <ww.wang...@gmail.com>
Subject Re: Recover special terms from StandardTokenizer
Date Sun, 13 Dec 2009 02:34:10 GMT
Thanks, Koji, I followed your advice and change my analyzer as shown below:
NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
RECOVERY_MAP.add("c++","cplusplus$");
CharFilter filter = new LowercaseCharFilter(reader);
filter = new MappingCharFilter(RECOVERY_MAP,filter);
StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
filter);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
result = new SnowballFilter(result, STEMMER);

I use the same analyzer in the search side. As you know, this analyzer can
token c++ as cplusplus, for this reason, it seems I can search c++ with the
same analyzer because it is also tokenized as cplusplus.

I tested it on as string c++c++, however, when i search c++ on the built
index, nothing is returned.

 I do not know what's wrong with my code. Waiting for your replay




On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang <ww.wang.cs@gmail.com> wrote:

> Thanks, Koji
>
>
> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi <koji@r.email.ne.jp>wrote:
>
>> MappingCharFilter can be used to convert c++ to cplusplus.
>>
>> Koji
>>
>> --
>> http://www.rondhuit.com/en/
>>
>>
>>
>> Anshum wrote:
>>
>>> How about getting the original token stream and then converting c++ to
>>> cplusplus or anyother such transform. Or perhaps you might look at
>>> using/extending(in the non java sense) some other tokenized!
>>>
>>> --
>>> Anshum Gupta
>>> Naukri Labs!
>>> http://ai-cafe.blogspot.com
>>>
>>> The facts expressed here belong to everybody, the opinions to me. The
>>> distinction is yours to draw............
>>>
>>>
>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww.wang.cs@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>> Hi, all,
>>>>    I designed a ftp search engine based on Lucene. I did a few
>>>> modifications to the StandardTokenizer.
>>>> My problem is:
>>>>  C++ is tokenized as c from StandardTokenizer and I want to recover it
>>>> from
>>>> the TokenStream from StandardTokenizer
>>>>
>>>> What should I do?
>>>>
>>>> --
>>>> Weiwei Wang
>>>> Alex Wang
>>>> 王巍巍
>>>> Room 403, Mengmin Wei Building
>>>> Computer Science Department
>>>> Gulou Campus of Nanjing University
>>>> Nanjing, P.R.China, 210093
>>>>
>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message