lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weiwei Wang <ww.wang...@gmail.com>
Subject Re: Recover special terms from StandardTokenizer
Date Sun, 13 Dec 2009 11:22:32 GMT
thanks, Uwe.
Maybe i was not very clear. My situation is like this:
Analyzer:
   NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
   RECOVERY_MAP.add("c++","cplusplus$");
    CharFilter filter = new LowercaseCharFilter(reader);
    filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
    StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
filter);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = getStopFilter(result);
    result = new SnowballFilter(result, STEMMER);
Analyze c++c++, return
(0,9)  [cplusplus]
(10,19)  [cplusplus]
the two numbers in th**e brackets are offsets.

So in the searching process when i want to hight the search keyword c++ with
the same analyzer, exception will be thrown because the string i stored are
c++c++ not cpluspluscplusplus(actually, i should not change the original
string when storing them, otherwise it will confuse the users).

I hope the analyzer can give result like this
(0,3) [cplusplus]
(3,6) [cplusplus]
then the Hilighter will works fine.

So how can I achieve this result?

2009/12/13 Uwe Schindler <uwe@thetaphi.de>

> MappingCharFilter preserves the offsets in the stream *before* filtering.
> So
> if you store the original string (without c++ replaced) in a stored field
> you can highlight using the given offstes. The highlighter must use again
> the same analyzer or use FastVectorHighlighter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > Sent: Sunday, December 13, 2009 11:43 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Recover special terms from StandardTokenizer
> >
> > Problem solved. Now another problem comes.
> >
> >
> > As I want to use Highlighter in my system, the token offset is incorrect
> > after the MappingCharFilter is used.
> >
> > Koji, do you known how to fix the offset problem?
> >
> > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww.wang.cs@gmail.com>
> > wrote:
> >
> > > I use Luke to check the result and find only c exists as a term, no
> > > cplusplus found in the index
> > >
> > >
> > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > <ww.wang.cs@gmail.com>wrote:
> > >
> > >> Thanks, Koji, I followed your advice and change my analyzer as shown
> > >> below:
> > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > >> RECOVERY_MAP.add("c++","cplusplus$");
> > >> CharFilter filter = new LowercaseCharFilter(reader);
> > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > >> StandardTokenizer tokenStream = new
> > StandardTokenizer(Version.LUCENE_30,
> > >> filter);
> > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > >> TokenStream result = new StandardFilter(tokenStream);
> > >> result = new LowerCaseFilter(result);
> > >> result = new StopFilter(enableStopPositionIncrements, result,
> stopSet);
> > >> result = new SnowballFilter(result, STEMMER);
> > >>
> > >> I use the same analyzer in the search side. As you know, this analyzer
> > can
> > >> token c++ as cplusplus, for this reason, it seems I can search c++
> with
> > >> the same analyzer because it is also tokenized as cplusplus.
> > >>
> > >> I tested it on as string c++c++, however, when i search c++ on the
> > built
> > >> index, nothing is returned.
> > >>
> > >>  I do not know what's wrong with my code. Waiting for your replay
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > <ww.wang.cs@gmail.com>wrote:
> > >>
> > >>> Thanks, Koji
> > >>>
> > >>>
> > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > <koji@r.email.ne.jp>wrote:
> > >>>
> > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> > >>>>
> > >>>> Koji
> > >>>>
> > >>>> --
> > >>>> http://www.rondhuit.com/en/
> > >>>>
> > >>>>
> > >>>>
> > >>>> Anshum wrote:
> > >>>>
> > >>>>> How about getting the original token stream and then converting
c++
> > to
> > >>>>> cplusplus or anyother such transform. Or perhaps you might
look at
> > >>>>> using/extending(in the non java sense) some other tokenized!
> > >>>>>
> > >>>>> --
> > >>>>> Anshum Gupta
> > >>>>> Naukri Labs!
> > >>>>> http://ai-cafe.blogspot.com
> > >>>>>
> > >>>>> The facts expressed here belong to everybody, the opinions
to me.
> > The
> > >>>>> distinction is yours to draw............
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
> ww.wang.cs@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> Hi, all,
> > >>>>>>    I designed a ftp search engine based on Lucene. I did
a few
> > >>>>>> modifications to the StandardTokenizer.
> > >>>>>> My problem is:
> > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want
to
> recover
> > it
> > >>>>>> from
> > >>>>>> the TokenStream from StandardTokenizer
> > >>>>>>
> > >>>>>> What should I do?
> > >>>>>>
> > >>>>>> --
> > >>>>>> Weiwei Wang
> > >>>>>> Alex Wang
> > >>>>>> 王巍巍
> > >>>>>> Room 403, Mengmin Wei Building
> > >>>>>> Computer Science Department
> > >>>>>> Gulou Campus of Nanjing University
> > >>>>>> Nanjing, P.R.China, 210093
> > >>>>>>
> > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Weiwei Wang
> > >>> Alex Wang
> > >>> 王巍巍
> > >>> Room 403, Mengmin Wei Building
> > >>> Computer Science Department
> > >>> Gulou Campus of Nanjing University
> > >>> Nanjing, P.R.China, 210093
> > >>>
> > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Weiwei Wang
> > >> Alex Wang
> > >> 王巍巍
> > >> Room 403, Mengmin Wei Building
> > >> Computer Science Department
> > >> Gulou Campus of Nanjing University
> > >> Nanjing, P.R.China, 210093
> > >>
> > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >>
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message