lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weiwei Wang <ww.wang...@gmail.com>
Subject Re: Recover special terms from StandardTokenizer
Date Sun, 13 Dec 2009 10:42:36 GMT
Problem solved. Now another problem comes.


As I want to use Highlighter in my system, the token offset is incorrect
after the MappingCharFilter is used.

Koji, do you known how to fix the offset problem?

On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww.wang.cs@gmail.com> wrote:

> I use Luke to check the result and find only c exists as a term, no
> cplusplus found in the index
>
>
> On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang <ww.wang.cs@gmail.com>wrote:
>
>> Thanks, Koji, I followed your advice and change my analyzer as shown
>> below:
>> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>> RECOVERY_MAP.add("c++","cplusplus$");
>> CharFilter filter = new LowercaseCharFilter(reader);
>> filter = new MappingCharFilter(RECOVERY_MAP,filter);
>> StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
>> filter);
>> tokenStream.setMaxTokenLength(maxTokenLength);
>> TokenStream result = new StandardFilter(tokenStream);
>> result = new LowerCaseFilter(result);
>> result = new StopFilter(enableStopPositionIncrements, result, stopSet);
>> result = new SnowballFilter(result, STEMMER);
>>
>> I use the same analyzer in the search side. As you know, this analyzer can
>> token c++ as cplusplus, for this reason, it seems I can search c++ with
>> the same analyzer because it is also tokenized as cplusplus.
>>
>> I tested it on as string c++c++, however, when i search c++ on the built
>> index, nothing is returned.
>>
>>  I do not know what's wrong with my code. Waiting for your replay
>>
>>
>>
>>
>>
>> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang <ww.wang.cs@gmail.com>wrote:
>>
>>> Thanks, Koji
>>>
>>>
>>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi <koji@r.email.ne.jp>wrote:
>>>
>>>> MappingCharFilter can be used to convert c++ to cplusplus.
>>>>
>>>> Koji
>>>>
>>>> --
>>>> http://www.rondhuit.com/en/
>>>>
>>>>
>>>>
>>>> Anshum wrote:
>>>>
>>>>> How about getting the original token stream and then converting c++ to
>>>>> cplusplus or anyother such transform. Or perhaps you might look at
>>>>> using/extending(in the non java sense) some other tokenized!
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>> Naukri Labs!
>>>>> http://ai-cafe.blogspot.com
>>>>>
>>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>>> distinction is yours to draw............
>>>>>
>>>>>
>>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww.wang.cs@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi, all,
>>>>>>    I designed a ftp search engine based on Lucene. I did a few
>>>>>> modifications to the StandardTokenizer.
>>>>>> My problem is:
>>>>>>  C++ is tokenized as c from StandardTokenizer and I want to recover
it
>>>>>> from
>>>>>> the TokenStream from StandardTokenizer
>>>>>>
>>>>>> What should I do?
>>>>>>
>>>>>> --
>>>>>> Weiwei Wang
>>>>>> Alex Wang
>>>>>> 王巍巍
>>>>>> Room 403, Mengmin Wei Building
>>>>>> Computer Science Department
>>>>>> Gulou Campus of Nanjing University
>>>>>> Nanjing, P.R.China, 210093
>>>>>>
>>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Weiwei Wang
>>> Alex Wang
>>> 王巍巍
>>> Room 403, Mengmin Wei Building
>>> Computer Science Department
>>> Gulou Campus of Nanjing University
>>> Nanjing, P.R.China, 210093
>>>
>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>
>>
>>
>>
>> --
>> Weiwei Wang
>> Alex Wang
>> 王巍巍
>> Room 403, Mengmin Wei Building
>> Computer Science Department
>> Gulou Campus of Nanjing University
>> Nanjing, P.R.China, 210093
>>
>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message