Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 94254 invoked from network); 13 Dec 2009 12:10:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Dec 2009 12:10:48 -0000 Received: (qmail 50362 invoked by uid 500); 13 Dec 2009 12:10:46 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 50269 invoked by uid 500); 13 Dec 2009 12:10:46 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 50259 invoked by uid 99); 13 Dec 2009 12:10:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 12:10:46 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 12:10:38 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 58AC6D36005 for ; Sun, 13 Dec 2009 13:10:16 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Kosyg6EA12ru for ; Sun, 13 Dec 2009 13:10:05 +0100 (CET) Received: from VEGA (port-83-236-62-54.dynamic.qsc.de [83.236.62.54]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 82FABD36004 for ; Sun, 13 Dec 2009 13:10:03 +0100 (CET) From: "Uwe Schindler" To: References: <7d94dcde0912102130u214d4a87r18b97575e223b521@mail.gmail.com> <867513fe0912110109i71d6491am10934c77190b2029@mail.gmail.com> <4B223416.2060701@r.email.ne.jp> <7d94dcde0912110543y63729d5eo14e6be2b992121b@mail.gmail.com> <7d94dcde0912121834l14d3b3acmff09dbcf5e7114d1@mail.gmail.com> <7d94dcde0912121912r4496b28cv261089aa1ba94f79@mail.gmail.com> <7d94dcde0912130242q38a50c5frcf8b0124fcaf4a91@mail.gmail.com> <8321DA8EE5DF498A838FC1696CB5E359@VEGA> <7d94dcde0912130322l73284a1ay3b8345c3eefdf10f@mail.gmail.com> <3E2DC705894D4E199EED0A9F45BF8EA2@VEGA> <7d94dcde0912130350u7ace7489i20a7f9f28a6de917@mail.gmail.com> Subject: RE: Recover special terms from StandardTokenizer Date: Sun, 13 Dec 2009 13:10:03 +0100 Message-ID: <19A1BCCC31934B8E88AA34C2DA38F139@VEGA> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 Thread-Index: Acp76pehgyHn23VpR+Wy43XR4RkLqAAAna6w In-Reply-To: <7d94dcde0912130350u7ace7489i20a7f9f28a6de917@mail.gmail.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 MappingCharFilter definitely preserves the offsets from the original reader. Yo can verify that for your case with Lucene’s testcase TestMappingCharFilter in the source distribution @ /src/test/org/apache/lucene/analysis/TestMappingCharFilter.java: public void test2to4() throws Exception { CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" ) ); TokenStream ts = new WhitespaceTokenizer( cs ); assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new int[] {2}); } So there is everything correct. I tried this test also with StandrdTokenizer instead of WhiteSpaceTokenizer - it works and asserts the correct offsets. You should debug through the incrementToken()/CharFilter calls and verify where your offsets change. I cannot help more. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com] > Sent: Sunday, December 13, 2009 12:51 PM > To: java-user@lucene.apache.org > Subject: Re: Recover special terms from StandardTokenizer > > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to > provide a NormalizeCharMap. We lowercase the stream so as we only provide > lowercase maps in the NormalizeCharMap, e.g. we provide map > (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus). > > C++ is only an example we want to fix, in the future we may add more such > special terms > > the code for LowercaseCharFilter is as follows: > package analysis; > > import java.io.IOException; > import java.io.Reader; > > import org.apache.lucene.analysis.BaseCharFilter; > import org.apache.lucene.analysis.CharReader; > import org.apache.lucene.analysis.CharStream; > > > public class LowercaseCharFilter extends BaseCharFilter > { > > public LowercaseCharFilter(CharStream in) > { > super(in); > } > > public LowercaseCharFilter(Reader in) > { > super(CharReader.get(in)); > } > @Override > public int read() throws IOException > { > return Character.toLowerCase(input.read()); > } > @Override > public int read(char[] cbuf, int off, int len) throws IOException { > int ret = input.read(cbuf, off, len); > if(ret!=-1) > { > for(int i=off; i cbuf[i] = Character.toLowerCase(cbuf[i]); > } > return ret; > } > } > > > Currently RosaMappingCharFilter is inherited from MappingCharFilter and > nothing is changed(i was planning to override addOffCorrectMap to fix my > problem, but it didn't work) > > > 2009/12/13 Uwe Schindler > > > I think your problem is theLowercaseCharFilter that does not pass > > correctOffset() to the underying CharFilter. Does it work better without > > your LowerCaseCharFilter (which is duplicate because there is already a > > LowerCaseFilter in the Tokenizer chain). > > > > As you are only looking for "c++", just also add a mapping for "C++" and > > you > > are done, why lowercasing all because of one char? > > > > And what's RosaMappingCharFilter? A pink one? *g* > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: uwe@thetaphi.de > > > > > -----Original Message----- > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com] > > > Sent: Sunday, December 13, 2009 12:23 PM > > > To: java-user@lucene.apache.org > > > Subject: Re: Recover special terms from StandardTokenizer > > > > > > thanks, Uwe. > > > Maybe i was not very clear. My situation is like this: > > > Analyzer: > > > NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > > RECOVERY_MAP.add("c++","cplusplus$"); > > > CharFilter filter = new LowercaseCharFilter(reader); > > > filter = new RosaMappingCharFilter(RECOVERY_MAP,filter); > > > StandardTokenizer tokenStream = new > > > StandardTokenizer(Version.LUCENE_30, > > > filter); > > > tokenStream.setMaxTokenLength(maxTokenLength); > > > TokenStream result = new StandardFilter(tokenStream); > > > result = getStopFilter(result); > > > result = new SnowballFilter(result, STEMMER); > > > Analyze c++c++, return > > > (0,9) [cplusplus] > > > (10,19) [cplusplus] > > > the two numbers in th**e brackets are offsets. > > > > > > So in the searching process when i want to hight the search keyword > c++ > > > with > > > the same analyzer, exception will be thrown because the string i > stored > > > are > > > c++c++ not cpluspluscplusplus(actually, i should not change the > original > > > string when storing them, otherwise it will confuse the users). > > > > > > I hope the analyzer can give result like this > > > (0,3) [cplusplus] > > > (3,6) [cplusplus] > > > then the Hilighter will works fine. > > > > > > So how can I achieve this result? > > > > > > 2009/12/13 Uwe Schindler > > > > > > > MappingCharFilter preserves the offsets in the stream *before* > > > filtering. > > > > So > > > > if you store the original string (without c++ replaced) in a stored > > > field > > > > you can highlight using the given offstes. The highlighter must use > > > again > > > > the same analyzer or use FastVectorHighlighter. > > > > > > > > ----- > > > > Uwe Schindler > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > http://www.thetaphi.de > > > > eMail: uwe@thetaphi.de > > > > > > > > > -----Original Message----- > > > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com] > > > > > Sent: Sunday, December 13, 2009 11:43 AM > > > > > To: java-user@lucene.apache.org > > > > > Subject: Re: Recover special terms from StandardTokenizer > > > > > > > > > > Problem solved. Now another problem comes. > > > > > > > > > > > > > > > As I want to use Highlighter in my system, the token offset is > > > incorrect > > > > > after the MappingCharFilter is used. > > > > > > > > > > Koji, do you known how to fix the offset problem? > > > > > > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang > > > > > > wrote: > > > > > > > > > > > I use Luke to check the result and find only c exists as a term, > no > > > > > > cplusplus found in the index > > > > > > > > > > > > > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang > > > > > wrote: > > > > > > > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as > > > shown > > > > > >> below: > > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > > > > >> RECOVERY_MAP.add("c++","cplusplus$"); > > > > > >> CharFilter filter = new LowercaseCharFilter(reader); > > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter); > > > > > >> StandardTokenizer tokenStream = new > > > > > StandardTokenizer(Version.LUCENE_30, > > > > > >> filter); > > > > > >> tokenStream.setMaxTokenLength(maxTokenLength); > > > > > >> TokenStream result = new StandardFilter(tokenStream); > > > > > >> result = new LowerCaseFilter(result); > > > > > >> result = new StopFilter(enableStopPositionIncrements, result, > > > > stopSet); > > > > > >> result = new SnowballFilter(result, STEMMER); > > > > > >> > > > > > >> I use the same analyzer in the search side. As you know, this > > > analyzer > > > > > can > > > > > >> token c++ as cplusplus, for this reason, it seems I can search > c++ > > > > with > > > > > >> the same analyzer because it is also tokenized as cplusplus. > > > > > >> > > > > > >> I tested it on as string c++c++, however, when i search c++ on > the > > > > > built > > > > > >> index, nothing is returned. > > > > > >> > > > > > >> I do not know what's wrong with my code. Waiting for your > replay > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang > > > > > wrote: > > > > > >> > > > > > >>> Thanks, Koji > > > > > >>> > > > > > >>> > > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi > > > > > wrote: > > > > > >>> > > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. > > > > > >>>> > > > > > >>>> Koji > > > > > >>>> > > > > > >>>> -- > > > > > >>>> http://www.rondhuit.com/en/ > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> Anshum wrote: > > > > > >>>> > > > > > >>>>> How about getting the original token stream and then > converting > > > c++ > > > > > to > > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might > look > > > at > > > > > >>>>> using/extending(in the non java sense) some other tokenized! > > > > > >>>>> > > > > > >>>>> -- > > > > > >>>>> Anshum Gupta > > > > > >>>>> Naukri Labs! > > > > > >>>>> http://ai-cafe.blogspot.com > > > > > >>>>> > > > > > >>>>> The facts expressed here belong to everybody, the opinions > to > > > me. > > > > > The > > > > > >>>>> distinction is yours to draw............ > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < > > > > ww.wang.cs@gmail.com> > > > > > >>>>> wrote: > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>>> Hi, all, > > > > > >>>>>> I designed a ftp search engine based on Lucene. I did a > few > > > > > >>>>>> modifications to the StandardTokenizer. > > > > > >>>>>> My problem is: > > > > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want to > > > > recover > > > > > it > > > > > >>>>>> from > > > > > >>>>>> the TokenStream from StandardTokenizer > > > > > >>>>>> > > > > > >>>>>> What should I do? > > > > > >>>>>> > > > > > >>>>>> -- > > > > > >>>>>> Weiwei Wang > > > > > >>>>>> Alex Wang > > > > > >>>>>> 王巍巍 > > > > > >>>>>> Room 403, Mengmin Wei Building > > > > > >>>>>> Computer Science Department > > > > > >>>>>> Gulou Campus of Nanjing University > > > > > >>>>>> Nanjing, P.R.China, 210093 > > > > > >>>>>> > > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > -------------------------------------------------------------------- > - > > > > > >>>> To unsubscribe, e-mail: java-user- > unsubscribe@lucene.apache.org > > > > > >>>> For additional commands, e-mail: > > java-user-help@lucene.apache.org > > > > > >>>> > > > > > >>>> > > > > > >>> > > > > > >>> > > > > > >>> -- > > > > > >>> Weiwei Wang > > > > > >>> Alex Wang > > > > > >>> 王巍巍 > > > > > >>> Room 403, Mengmin Wei Building > > > > > >>> Computer Science Department > > > > > >>> Gulou Campus of Nanjing University > > > > > >>> Nanjing, P.R.China, 210093 > > > > > >>> > > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > >>> > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> Weiwei Wang > > > > > >> Alex Wang > > > > > >> 王巍巍 > > > > > >> Room 403, Mengmin Wei Building > > > > > >> Computer Science Department > > > > > >> Gulou Campus of Nanjing University > > > > > >> Nanjing, P.R.China, 210093 > > > > > >> > > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Weiwei Wang > > > > > > Alex Wang > > > > > > 王巍巍 > > > > > > Room 403, Mengmin Wei Building > > > > > > Computer Science Department > > > > > > Gulou Campus of Nanjing University > > > > > > Nanjing, P.R.China, 210093 > > > > > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Weiwei Wang > > > > > Alex Wang > > > > > 王巍巍 > > > > > Room 403, Mengmin Wei Building > > > > > Computer Science Department > > > > > Gulou Campus of Nanjing University > > > > > Nanjing, P.R.China, 210093 > > > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > > -------------------------------------------------------------------- > - > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > > > > > > > > -- > > > Weiwei Wang > > > Alex Wang > > > 王巍巍 > > > Room 403, Mengmin Wei Building > > > Computer Science Department > > > Gulou Campus of Nanjing University > > > Nanjing, P.R.China, 210093 > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > -- > Weiwei Wang > Alex Wang > 王巍巍 > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org