Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 86779 invoked from network); 13 Dec 2009 11:43:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Dec 2009 11:43:41 -0000 Received: (qmail 32546 invoked by uid 500); 13 Dec 2009 11:43:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 32462 invoked by uid 500); 13 Dec 2009 11:43:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 32452 invoked by uid 99); 13 Dec 2009 11:43:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 11:43:39 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 11:43:29 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 209C4D36005 for ; Sun, 13 Dec 2009 12:43:09 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pVmd-w5XnLoj for ; Sun, 13 Dec 2009 12:43:00 +0100 (CET) Received: from VEGA (port-83-236-62-54.dynamic.qsc.de [83.236.62.54]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 28A35D36004 for ; Sun, 13 Dec 2009 12:43:00 +0100 (CET) From: "Uwe Schindler" To: References: <7d94dcde0912102130u214d4a87r18b97575e223b521@mail.gmail.com> <867513fe0912110109i71d6491am10934c77190b2029@mail.gmail.com> <4B223416.2060701@r.email.ne.jp> <7d94dcde0912110543y63729d5eo14e6be2b992121b@mail.gmail.com> <7d94dcde0912121834l14d3b3acmff09dbcf5e7114d1@mail.gmail.com> <7d94dcde0912121912r4496b28cv261089aa1ba94f79@mail.gmail.com> <7d94dcde0912130242q38a50c5frcf8b0124fcaf4a91@mail.gmail.com> <8321DA8EE5DF498A838FC1696CB5E359@VEGA> <7d94dcde0912130322l73284a1ay3b8345c3eefdf10f@mail.gmail.com> Subject: RE: Recover special terms from StandardTokenizer Date: Sun, 13 Dec 2009 12:42:59 +0100 Message-ID: <3E2DC705894D4E199EED0A9F45BF8EA2@VEGA> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <7d94dcde0912130322l73284a1ay3b8345c3eefdf10f@mail.gmail.com> Thread-Index: Acp75rJyWxLSbk6QSfy8rs2VfXTQNQAAjZWg X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 X-Virus-Checked: Checked by ClamAV on apache.org I think your problem is theLowercaseCharFilter that does not pass correctOffset() to the underying CharFilter. Does it work better without your LowerCaseCharFilter (which is duplicate because there is already a LowerCaseFilter in the Tokenizer chain). As you are only looking for "c++", just also add a mapping for "C++" and you are done, why lowercasing all because of one char? And what's RosaMappingCharFilter? A pink one? *g* ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com] > Sent: Sunday, December 13, 2009 12:23 PM > To: java-user@lucene.apache.org > Subject: Re: Recover special terms from StandardTokenizer > > thanks, Uwe. > Maybe i was not very clear. My situation is like this: > Analyzer: > NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > RECOVERY_MAP.add("c++","cplusplus$"); > CharFilter filter = new LowercaseCharFilter(reader); > filter = new RosaMappingCharFilter(RECOVERY_MAP,filter); > StandardTokenizer tokenStream = new > StandardTokenizer(Version.LUCENE_30, > filter); > tokenStream.setMaxTokenLength(maxTokenLength); > TokenStream result = new StandardFilter(tokenStream); > result = getStopFilter(result); > result = new SnowballFilter(result, STEMMER); > Analyze c++c++, return > (0,9) [cplusplus] > (10,19) [cplusplus] > the two numbers in th**e brackets are offsets. > > So in the searching process when i want to hight the search keyword c++ > with > the same analyzer, exception will be thrown because the string i stored > are > c++c++ not cpluspluscplusplus(actually, i should not change the original > string when storing them, otherwise it will confuse the users). > > I hope the analyzer can give result like this > (0,3) [cplusplus] > (3,6) [cplusplus] > then the Hilighter will works fine. > > So how can I achieve this result? > > 2009/12/13 Uwe Schindler > > > MappingCharFilter preserves the offsets in the stream *before* > filtering. > > So > > if you store the original string (without c++ replaced) in a stored > field > > you can highlight using the given offstes. The highlighter must use > again > > the same analyzer or use FastVectorHighlighter. > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: uwe@thetaphi.de > > > > > -----Original Message----- > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com] > > > Sent: Sunday, December 13, 2009 11:43 AM > > > To: java-user@lucene.apache.org > > > Subject: Re: Recover special terms from StandardTokenizer > > > > > > Problem solved. Now another problem comes. > > > > > > > > > As I want to use Highlighter in my system, the token offset is > incorrect > > > after the MappingCharFilter is used. > > > > > > Koji, do you known how to fix the offset problem? > > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang > > > wrote: > > > > > > > I use Luke to check the result and find only c exists as a term, no > > > > cplusplus found in the index > > > > > > > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang > > > wrote: > > > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as > shown > > > >> below: > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > > >> RECOVERY_MAP.add("c++","cplusplus$"); > > > >> CharFilter filter = new LowercaseCharFilter(reader); > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter); > > > >> StandardTokenizer tokenStream = new > > > StandardTokenizer(Version.LUCENE_30, > > > >> filter); > > > >> tokenStream.setMaxTokenLength(maxTokenLength); > > > >> TokenStream result = new StandardFilter(tokenStream); > > > >> result = new LowerCaseFilter(result); > > > >> result = new StopFilter(enableStopPositionIncrements, result, > > stopSet); > > > >> result = new SnowballFilter(result, STEMMER); > > > >> > > > >> I use the same analyzer in the search side. As you know, this > analyzer > > > can > > > >> token c++ as cplusplus, for this reason, it seems I can search c++ > > with > > > >> the same analyzer because it is also tokenized as cplusplus. > > > >> > > > >> I tested it on as string c++c++, however, when i search c++ on the > > > built > > > >> index, nothing is returned. > > > >> > > > >> I do not know what's wrong with my code. Waiting for your replay > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang > > > wrote: > > > >> > > > >>> Thanks, Koji > > > >>> > > > >>> > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi > > > wrote: > > > >>> > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. > > > >>>> > > > >>>> Koji > > > >>>> > > > >>>> -- > > > >>>> http://www.rondhuit.com/en/ > > > >>>> > > > >>>> > > > >>>> > > > >>>> Anshum wrote: > > > >>>> > > > >>>>> How about getting the original token stream and then converting > c++ > > > to > > > >>>>> cplusplus or anyother such transform. Or perhaps you might look > at > > > >>>>> using/extending(in the non java sense) some other tokenized! > > > >>>>> > > > >>>>> -- > > > >>>>> Anshum Gupta > > > >>>>> Naukri Labs! > > > >>>>> http://ai-cafe.blogspot.com > > > >>>>> > > > >>>>> The facts expressed here belong to everybody, the opinions to > me. > > > The > > > >>>>> distinction is yours to draw............ > > > >>>>> > > > >>>>> > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < > > ww.wang.cs@gmail.com> > > > >>>>> wrote: > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>>> Hi, all, > > > >>>>>> I designed a ftp search engine based on Lucene. I did a few > > > >>>>>> modifications to the StandardTokenizer. > > > >>>>>> My problem is: > > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want to > > recover > > > it > > > >>>>>> from > > > >>>>>> the TokenStream from StandardTokenizer > > > >>>>>> > > > >>>>>> What should I do? > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Weiwei Wang > > > >>>>>> Alex Wang > > > >>>>>> 王巍巍 > > > >>>>>> Room 403, Mengmin Wei Building > > > >>>>>> Computer Science Department > > > >>>>>> Gulou Campus of Nanjing University > > > >>>>>> Nanjing, P.R.China, 210093 > > > >>>>>> > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > --------------------------------------------------------------------- > > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org > > > >>>> > > > >>>> > > > >>> > > > >>> > > > >>> -- > > > >>> Weiwei Wang > > > >>> Alex Wang > > > >>> 王巍巍 > > > >>> Room 403, Mengmin Wei Building > > > >>> Computer Science Department > > > >>> Gulou Campus of Nanjing University > > > >>> Nanjing, P.R.China, 210093 > > > >>> > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > >>> > > > >> > > > >> > > > >> > > > >> -- > > > >> Weiwei Wang > > > >> Alex Wang > > > >> 王巍巍 > > > >> Room 403, Mengmin Wei Building > > > >> Computer Science Department > > > >> Gulou Campus of Nanjing University > > > >> Nanjing, P.R.China, 210093 > > > >> > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > >> > > > > > > > > > > > > > > > > -- > > > > Weiwei Wang > > > > Alex Wang > > > > 王巍巍 > > > > Room 403, Mengmin Wei Building > > > > Computer Science Department > > > > Gulou Campus of Nanjing University > > > > Nanjing, P.R.China, 210093 > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > > > > > > -- > > > Weiwei Wang > > > Alex Wang > > > 王巍巍 > > > Room 403, Mengmin Wei Building > > > Computer Science Department > > > Gulou Campus of Nanjing University > > > Nanjing, P.R.China, 210093 > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > -- > Weiwei Wang > Alex Wang > 王巍巍 > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org