lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Recover special terms from StandardTokenizer
Date Sun, 13 Dec 2009 10:46:46 GMT
MappingCharFilter preserves the offsets in the stream *before* filtering. So
if you store the original string (without c++ replaced) in a stored field
you can highlight using the given offstes. The highlighter must use again
the same analyzer or use FastVectorHighlighter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> Sent: Sunday, December 13, 2009 11:43 AM
> To: java-user@lucene.apache.org
> Subject: Re: Recover special terms from StandardTokenizer
> 
> Problem solved. Now another problem comes.
> 
> 
> As I want to use Highlighter in my system, the token offset is incorrect
> after the MappingCharFilter is used.
> 
> Koji, do you known how to fix the offset problem?
> 
> On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww.wang.cs@gmail.com>
> wrote:
> 
> > I use Luke to check the result and find only c exists as a term, no
> > cplusplus found in the index
> >
> >
> > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> <ww.wang.cs@gmail.com>wrote:
> >
> >> Thanks, Koji, I followed your advice and change my analyzer as shown
> >> below:
> >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> >> RECOVERY_MAP.add("c++","cplusplus$");
> >> CharFilter filter = new LowercaseCharFilter(reader);
> >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> >> StandardTokenizer tokenStream = new
> StandardTokenizer(Version.LUCENE_30,
> >> filter);
> >> tokenStream.setMaxTokenLength(maxTokenLength);
> >> TokenStream result = new StandardFilter(tokenStream);
> >> result = new LowerCaseFilter(result);
> >> result = new StopFilter(enableStopPositionIncrements, result, stopSet);
> >> result = new SnowballFilter(result, STEMMER);
> >>
> >> I use the same analyzer in the search side. As you know, this analyzer
> can
> >> token c++ as cplusplus, for this reason, it seems I can search c++ with
> >> the same analyzer because it is also tokenized as cplusplus.
> >>
> >> I tested it on as string c++c++, however, when i search c++ on the
> built
> >> index, nothing is returned.
> >>
> >>  I do not know what's wrong with my code. Waiting for your replay
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> <ww.wang.cs@gmail.com>wrote:
> >>
> >>> Thanks, Koji
> >>>
> >>>
> >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> <koji@r.email.ne.jp>wrote:
> >>>
> >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> >>>>
> >>>> Koji
> >>>>
> >>>> --
> >>>> http://www.rondhuit.com/en/
> >>>>
> >>>>
> >>>>
> >>>> Anshum wrote:
> >>>>
> >>>>> How about getting the original token stream and then converting
c++
> to
> >>>>> cplusplus or anyother such transform. Or perhaps you might look
at
> >>>>> using/extending(in the non java sense) some other tokenized!
> >>>>>
> >>>>> --
> >>>>> Anshum Gupta
> >>>>> Naukri Labs!
> >>>>> http://ai-cafe.blogspot.com
> >>>>>
> >>>>> The facts expressed here belong to everybody, the opinions to me.
> The
> >>>>> distinction is yours to draw............
> >>>>>
> >>>>>
> >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww.wang.cs@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Hi, all,
> >>>>>>    I designed a ftp search engine based on Lucene. I did a few
> >>>>>> modifications to the StandardTokenizer.
> >>>>>> My problem is:
> >>>>>>  C++ is tokenized as c from StandardTokenizer and I want to
recover
> it
> >>>>>> from
> >>>>>> the TokenStream from StandardTokenizer
> >>>>>>
> >>>>>> What should I do?
> >>>>>>
> >>>>>> --
> >>>>>> Weiwei Wang
> >>>>>> Alex Wang
> >>>>>> 王巍巍
> >>>>>> Room 403, Mengmin Wei Building
> >>>>>> Computer Science Department
> >>>>>> Gulou Campus of Nanjing University
> >>>>>> Nanjing, P.R.China, 210093
> >>>>>>
> >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Weiwei Wang
> >>> Alex Wang
> >>> 王巍巍
> >>> Room 403, Mengmin Wei Building
> >>> Computer Science Department
> >>> Gulou Campus of Nanjing University
> >>> Nanjing, P.R.China, 210093
> >>>
> >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >>>
> >>
> >>
> >>
> >> --
> >> Weiwei Wang
> >> Alex Wang
> >> 王巍巍
> >> Room 403, Mengmin Wei Building
> >> Computer Science Department
> >> Gulou Campus of Nanjing University
> >> Nanjing, P.R.China, 210093
> >>
> >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >>
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
> 
> 
> 
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
> 
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message