Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: <7d94dcde0912102130u214d4a87r18b97575e223b521@mail.gmail.com>
 <867513fe0912110109i71d6491am10934c77190b2029@mail.gmail.com>
 <4B223416.2060701@r.email.ne.jp>
 <7d94dcde0912110543y63729d5eo14e6be2b992121b@mail.gmail.com>
 <7d94dcde0912121834l14d3b3acmff09dbcf5e7114d1@mail.gmail.com>
 <7d94dcde0912121912r4496b28cv261089aa1ba94f79@mail.gmail.com>
 <7d94dcde0912130242q38a50c5frcf8b0124fcaf4a91@mail.gmail.com>
 <8321DA8EE5DF498A838FC1696CB5E359@VEGA>
 <7d94dcde0912130322l73284a1ay3b8345c3eefdf10f@mail.gmail.com>
Subject: RE: Recover special terms from StandardTokenizer
Date: Sun, 13 Dec 2009 12:42:59 +0100
Message-ID: <3E2DC705894D4E199EED0A9F45BF8EA2@VEGA>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-2022-jp"
Content-Transfer-Encoding: 7bit
In-Reply-To: <7d94dcde0912130322l73284a1ay3b8345c3eefdf10f@mail.gmail.com>
Thread-Index: Acp75rJyWxLSbk6QSfy8rs2VfXTQNQAAjZWg

I think your problem is theLowercaseCharFilter that does not pass
correctOffset() to the underying CharFilter. Does it work better without
your LowerCaseCharFilter (which is duplicate because there is already a
LowerCaseFilter in the Tokenizer chain).

As you are only looking for "c++", just also add a mapping for "C++" and you
are done, why lowercasing all because of one char?

And what's RosaMappingCharFilter? A pink one? *g*

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> Sent: Sunday, December 13, 2009 12:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: Recover special terms from StandardTokenizer
> 
> thanks, Uwe.
> Maybe i was not very clear. My situation is like this:
> Analyzer:
>    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>    RECOVERY_MAP.add("c++","cplusplus$");
>     CharFilter filter = new LowercaseCharFilter(reader);
>     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
>     StandardTokenizer tokenStream = new
> StandardTokenizer(Version.LUCENE_30,
> filter);
>     tokenStream.setMaxTokenLength(maxTokenLength);
>     TokenStream result = new StandardFilter(tokenStream);
>     result = getStopFilter(result);
>     result = new SnowballFilter(result, STEMMER);
> Analyze c++c++, return
> (0,9)  [cplusplus]
> (10,19)  [cplusplus]
> the two numbers in th**e brackets are offsets.
> 
> So in the searching process when i want to hight the search keyword c++
> with
> the same analyzer, exception will be thrown because the string i stored
> are
> c++c++ not cpluspluscplusplus(actually, i should not change the original
> string when storing them, otherwise it will confuse the users).
> 
> I hope the analyzer can give result like this
> (0,3) [cplusplus]
> (3,6) [cplusplus]
> then the Hilighter will works fine.
> 
> So how can I achieve this result?
> 
> 2009/12/13 Uwe Schindler <uwe@thetaphi.de>
> 
> > MappingCharFilter preserves the offsets in the stream *before*
> filtering.
> > So
> > if you store the original string (without c++ replaced) in a stored
> field
> > you can highlight using the given offstes. The highlighter must use
> again
> > the same analyzer or use FastVectorHighlighter.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > Sent: Sunday, December 13, 2009 11:43 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Recover special terms from StandardTokenizer
> > >
> > > Problem solved. Now another problem comes.
> > >
> > >
> > > As I want to use Highlighter in my system, the token offset is
> incorrect
> > > after the MappingCharFilter is used.
> > >
> > > Koji, do you known how to fix the offset problem?
> > >
> > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww.wang.cs@gmail.com>
> > > wrote:
> > >
> > > > I use Luke to check the result and find only c exists as a term, no
> > > > cplusplus found in the index
> > > >
> > > >
> > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > > <ww.wang.cs@gmail.com>wrote:
> > > >
> > > >> Thanks, Koji, I followed your advice and change my analyzer as
> shown
> > > >> below:
> > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > >> RECOVERY_MAP.add("c++","cplusplus$");
> > > >> CharFilter filter = new LowercaseCharFilter(reader);
> > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > > >> StandardTokenizer tokenStream = new
> > > StandardTokenizer(Version.LUCENE_30,
> > > >> filter);
> > > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > > >> TokenStream result = new StandardFilter(tokenStream);
> > > >> result = new LowerCaseFilter(result);
> > > >> result = new StopFilter(enableStopPositionIncrements, result,
> > stopSet);
> > > >> result = new SnowballFilter(result, STEMMER);
> > > >>
> > > >> I use the same analyzer in the search side. As you know, this
> analyzer
> > > can
> > > >> token c++ as cplusplus, for this reason, it seems I can search c++
> > with
> > > >> the same analyzer because it is also tokenized as cplusplus.
> > > >>
> > > >> I tested it on as string c++c++, however, when i search c++ on the
> > > built
> > > >> index, nothing is returned.
> > > >>
> > > >>  I do not know what's wrong with my code. Waiting for your replay
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > > <ww.wang.cs@gmail.com>wrote:
> > > >>
> > > >>> Thanks, Koji
> > > >>>
> > > >>>
> > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > > <koji@r.email.ne.jp>wrote:
> > > >>>
> > > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> > > >>>>
> > > >>>> Koji
> > > >>>>
> > > >>>> --
> > > >>>> http://www.rondhuit.com/en/
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Anshum wrote:
> > > >>>>
> > > >>>>> How about getting the original token stream and then converting
> c++
> > > to
> > > >>>>> cplusplus or anyother such transform. Or perhaps you might look
> at
> > > >>>>> using/extending(in the non java sense) some other tokenized!
> > > >>>>>
> > > >>>>> --
> > > >>>>> Anshum Gupta
> > > >>>>> Naukri Labs!
> > > >>>>> http://ai-cafe.blogspot.com
> > > >>>>>
> > > >>>>> The facts expressed here belong to everybody, the opinions to
> me.
> > > The
> > > >>>>> distinction is yours to draw............
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
> > ww.wang.cs@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>> Hi, all,
> > > >>>>>>    I designed a ftp search engine based on Lucene. I did a few
> > > >>>>>> modifications to the StandardTokenizer.
> > > >>>>>> My problem is:
> > > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want to
> > recover
> > > it
> > > >>>>>> from
> > > >>>>>> the TokenStream from StandardTokenizer
> > > >>>>>>
> > > >>>>>> What should I do?
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Weiwei Wang
> > > >>>>>> Alex Wang
> > > >>>>>> 王巍巍
> > > >>>>>> Room 403, Mengmin Wei Building
> > > >>>>>> Computer Science Department
> > > >>>>>> Gulou Campus of Nanjing University
> > > >>>>>> Nanjing, P.R.China, 210093
> > > >>>>>>
> > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Weiwei Wang
> > > >>> Alex Wang
> > > >>> 王巍巍
> > > >>> Room 403, Mengmin Wei Building
> > > >>> Computer Science Department
> > > >>> Gulou Campus of Nanjing University
> > > >>> Nanjing, P.R.China, 210093
> > > >>>
> > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Weiwei Wang
> > > >> Alex Wang
> > > >> 王巍巍
> > > >> Room 403, Mengmin Wei Building
> > > >> Computer Science Department
> > > >> Gulou Campus of Nanjing University
> > > >> Nanjing, P.R.China, 210093
> > > >>
> > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
> 
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org