Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82541 invoked from network); 13 Dec 2009 11:23:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Dec 2009 11:23:07 -0000 Received: (qmail 17510 invoked by uid 500); 13 Dec 2009 11:23:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 17431 invoked by uid 500); 13 Dec 2009 11:23:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 17421 invoked by uid 99); 13 Dec 2009 11:23:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 11:23:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ww.wang.cs@gmail.com designates 209.85.211.185 as permitted sender) Received: from [209.85.211.185] (HELO mail-yw0-f185.google.com) (209.85.211.185) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 11:22:54 +0000 Received: by ywh15 with SMTP id 15so2242434ywh.5 for ; Sun, 13 Dec 2009 03:22:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=igD2OhcGWB6WtbyXluwPrjm2y3MJSvGJyC0/sJ0AYu0=; b=k3hqjXv65zKedrTbADkrB9QnvSjPVwyXpnco61G8oHTkMxMpsDdgjpCB/lEGiIEPDz 21Eub5WM0P7ugEftBjkACifQ0LClzswt4W585pgwW/319by3EqYENIkZ1Tyz/hAKJCTD b6Qcfj6Sn/z6OLqpix5xZ1I47v6/dNRbc3TfI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=qCfc1c5ihu6FeVKNzQGdQ1mNlNY5zVJ4iupBTsdgF1H6ZuRuV+L5V726vLUmT2mP5C 0i3zOA2/vfPgszoZfn62UGY6q6Bm8Y05lmYZjPbgkRflOredvN5w+ErNNkaljlrElc3d qgT+T/+XXEhxEUEMJkdIHUbvI+LAwRuB+OByY= MIME-Version: 1.0 Received: by 10.91.51.28 with SMTP id d28mr3381750agk.120.1260703352390; Sun, 13 Dec 2009 03:22:32 -0800 (PST) In-Reply-To: <8321DA8EE5DF498A838FC1696CB5E359@VEGA> References: <7d94dcde0912102130u214d4a87r18b97575e223b521@mail.gmail.com> <867513fe0912110109i71d6491am10934c77190b2029@mail.gmail.com> <4B223416.2060701@r.email.ne.jp> <7d94dcde0912110543y63729d5eo14e6be2b992121b@mail.gmail.com> <7d94dcde0912121834l14d3b3acmff09dbcf5e7114d1@mail.gmail.com> <7d94dcde0912121912r4496b28cv261089aa1ba94f79@mail.gmail.com> <7d94dcde0912130242q38a50c5frcf8b0124fcaf4a91@mail.gmail.com> <8321DA8EE5DF498A838FC1696CB5E359@VEGA> Date: Sun, 13 Dec 2009 19:22:32 +0800 Message-ID: <7d94dcde0912130322l73284a1ay3b8345c3eefdf10f@mail.gmail.com> Subject: Re: Recover special terms from StandardTokenizer From: Weiwei Wang To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001485f647327c984a047a9a5fa0 X-Virus-Checked: Checked by ClamAV on apache.org --001485f647327c984a047a9a5fa0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable thanks, Uwe. Maybe i was not very clear. My situation is like this: Analyzer: NormalizeCharMap RECOVERY_MAP =3D new NormalizeCharMap(); RECOVERY_MAP.add("c++","cplusplus$"); CharFilter filter =3D new LowercaseCharFilter(reader); filter =3D new RosaMappingCharFilter(RECOVERY_MAP,filter); StandardTokenizer tokenStream =3D new StandardTokenizer(Version.LUCENE_= 30, filter); tokenStream.setMaxTokenLength(maxTokenLength); TokenStream result =3D new StandardFilter(tokenStream); result =3D getStopFilter(result); result =3D new SnowballFilter(result, STEMMER); Analyze c++c++, return (0,9) [cplusplus] (10,19) [cplusplus] the two numbers in th**e brackets are offsets. So in the searching process when i want to hight the search keyword c++ wit= h the same analyzer, exception will be thrown because the string i stored are c++c++ not cpluspluscplusplus(actually, i should not change the original string when storing them, otherwise it will confuse the users). I hope the analyzer can give result like this (0,3) [cplusplus] (3,6) [cplusplus] then the Hilighter will works fine. So how can I achieve this result? 2009/12/13 Uwe Schindler > MappingCharFilter preserves the offsets in the stream *before* filtering. > So > if you store the original string (without c++ replaced) in a stored field > you can highlight using the given offstes. The highlighter must use again > the same analyzer or use FastVectorHighlighter. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: uwe@thetaphi.de > > > -----Original Message----- > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com] > > Sent: Sunday, December 13, 2009 11:43 AM > > To: java-user@lucene.apache.org > > Subject: Re: Recover special terms from StandardTokenizer > > > > Problem solved. Now another problem comes. > > > > > > As I want to use Highlighter in my system, the token offset is incorrec= t > > after the MappingCharFilter is used. > > > > Koji, do you known how to fix the offset problem? > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang > > wrote: > > > > > I use Luke to check the result and find only c exists as a term, no > > > cplusplus found in the index > > > > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang > > wrote: > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as shown > > >> below: > > >> NormalizeCharMap RECOVERY_MAP =3D new NormalizeCharMap(); > > >> RECOVERY_MAP.add("c++","cplusplus$"); > > >> CharFilter filter =3D new LowercaseCharFilter(reader); > > >> filter =3D new MappingCharFilter(RECOVERY_MAP,filter); > > >> StandardTokenizer tokenStream =3D new > > StandardTokenizer(Version.LUCENE_30, > > >> filter); > > >> tokenStream.setMaxTokenLength(maxTokenLength); > > >> TokenStream result =3D new StandardFilter(tokenStream); > > >> result =3D new LowerCaseFilter(result); > > >> result =3D new StopFilter(enableStopPositionIncrements, result, > stopSet); > > >> result =3D new SnowballFilter(result, STEMMER); > > >> > > >> I use the same analyzer in the search side. As you know, this analyz= er > > can > > >> token c++ as cplusplus, for this reason, it seems I can search c++ > with > > >> the same analyzer because it is also tokenized as cplusplus. > > >> > > >> I tested it on as string c++c++, however, when i search c++ on the > > built > > >> index, nothing is returned. > > >> > > >> I do not know what's wrong with my code. Waiting for your replay > > >> > > >> > > >> > > >> > > >> > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang > > wrote: > > >> > > >>> Thanks, Koji > > >>> > > >>> > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi > > wrote: > > >>> > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. > > >>>> > > >>>> Koji > > >>>> > > >>>> -- > > >>>> http://www.rondhuit.com/en/ > > >>>> > > >>>> > > >>>> > > >>>> Anshum wrote: > > >>>> > > >>>>> How about getting the original token stream and then converting c= ++ > > to > > >>>>> cplusplus or anyother such transform. Or perhaps you might look a= t > > >>>>> using/extending(in the non java sense) some other tokenized! > > >>>>> > > >>>>> -- > > >>>>> Anshum Gupta > > >>>>> Naukri Labs! > > >>>>> http://ai-cafe.blogspot.com > > >>>>> > > >>>>> The facts expressed here belong to everybody, the opinions to me. > > The > > >>>>> distinction is yours to draw............ > > >>>>> > > >>>>> > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < > ww.wang.cs@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>> > > >>>>> > > >>>>>> Hi, all, > > >>>>>> I designed a ftp search engine based on Lucene. I did a few > > >>>>>> modifications to the StandardTokenizer. > > >>>>>> My problem is: > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want to > recover > > it > > >>>>>> from > > >>>>>> the TokenStream from StandardTokenizer > > >>>>>> > > >>>>>> What should I do? > > >>>>>> > > >>>>>> -- > > >>>>>> Weiwei Wang > > >>>>>> Alex Wang > > >>>>>> =E7=8E=8B=E5=B7=8D=E5=B7=8D > > >>>>>> Room 403, Mengmin Wei Building > > >>>>>> Computer Science Department > > >>>>>> Gulou Campus of Nanjing University > > >>>>>> Nanjing, P.R.China, 210093 > > >>>>>> > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>>> > > >>>> > --------------------------------------------------------------------- > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org > > >>>> > > >>>> > > >>> > > >>> > > >>> -- > > >>> Weiwei Wang > > >>> Alex Wang > > >>> =E7=8E=8B=E5=B7=8D=E5=B7=8D > > >>> Room 403, Mengmin Wei Building > > >>> Computer Science Department > > >>> Gulou Campus of Nanjing University > > >>> Nanjing, P.R.China, 210093 > > >>> > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > >>> > > >> > > >> > > >> > > >> -- > > >> Weiwei Wang > > >> Alex Wang > > >> =E7=8E=8B=E5=B7=8D=E5=B7=8D > > >> Room 403, Mengmin Wei Building > > >> Computer Science Department > > >> Gulou Campus of Nanjing University > > >> Nanjing, P.R.China, 210093 > > >> > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > >> > > > > > > > > > > > > -- > > > Weiwei Wang > > > Alex Wang > > > =E7=8E=8B=E5=B7=8D=E5=B7=8D > > > Room 403, Mengmin Wei Building > > > Computer Science Department > > > Gulou Campus of Nanjing University > > > Nanjing, P.R.China, 210093 > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > -- > > Weiwei Wang > > Alex Wang > > =E7=8E=8B=E5=B7=8D=E5=B7=8D > > Room 403, Mengmin Wei Building > > Computer Science Department > > Gulou Campus of Nanjing University > > Nanjing, P.R.China, 210093 > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --=20 Weiwei Wang Alex Wang =E7=8E=8B=E5=B7=8D=E5=B7=8D Room 403, Mengmin Wei Building Computer Science Department Gulou Campus of Nanjing University Nanjing, P.R.China, 210093 Homepage: http://cs.nju.edu.cn/rl/weiweiwang --001485f647327c984a047a9a5fa0--