Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 76252 invoked from network); 13 Dec 2009 10:43:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Dec 2009 10:43:11 -0000 Received: (qmail 99184 invoked by uid 500); 13 Dec 2009 10:43:08 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 99113 invoked by uid 500); 13 Dec 2009 10:43:07 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 99103 invoked by uid 99); 13 Dec 2009 10:43:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 10:43:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ww.wang.cs@gmail.com designates 209.85.211.185 as permitted sender) Received: from [209.85.211.185] (HELO mail-yw0-f185.google.com) (209.85.211.185) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Dec 2009 10:42:57 +0000 Received: by ywh15 with SMTP id 15so2232343ywh.5 for ; Sun, 13 Dec 2009 02:42:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=7VlKo2W5NQuES1TVZ6YIysAnYyRoQhHiIrl/gmrY72U=; b=L7ABrsKN2LUCRYGU8PP7cqI2R50vQPQ/Q979Y8hINO3zIX8jweuXFnySlzHy4YV04n /Wi87V06n+njzAndQYYjoJ4Go92F+0dLNiFMpepKKgNvbQ4zWyYNXn//tqoENkwr/OvN rots39BW7uNl267F+q19ujb0tbCM9huqb4xbs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=RvAQns54ZhHNQjrNMckK2ARA5Hc1iVfTY3VNEOcKUwsG4dgsOWZdgpRJ/lRRC3ACUY 6hmp4xP4rBgWsWZIEVA3pKPV2l33fT/7BIdXVHLZ1N02PbktYXjo3QrKN9LGaUxr0xcb swnWNhuCiAz/ypbNmWQGbiL8wDDS1JhsP5Di8= MIME-Version: 1.0 Received: by 10.90.24.26 with SMTP id 26mr3581480agx.37.1260700956305; Sun, 13 Dec 2009 02:42:36 -0800 (PST) In-Reply-To: <7d94dcde0912121912r4496b28cv261089aa1ba94f79@mail.gmail.com> References: <7d94dcde0912102130u214d4a87r18b97575e223b521@mail.gmail.com> <867513fe0912110109i71d6491am10934c77190b2029@mail.gmail.com> <4B223416.2060701@r.email.ne.jp> <7d94dcde0912110543y63729d5eo14e6be2b992121b@mail.gmail.com> <7d94dcde0912121834l14d3b3acmff09dbcf5e7114d1@mail.gmail.com> <7d94dcde0912121912r4496b28cv261089aa1ba94f79@mail.gmail.com> Date: Sun, 13 Dec 2009 18:42:36 +0800 Message-ID: <7d94dcde0912130242q38a50c5frcf8b0124fcaf4a91@mail.gmail.com> Subject: Re: Recover special terms from StandardTokenizer From: Weiwei Wang To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016361648edab2c43047a99d074 X-Virus-Checked: Checked by ClamAV on apache.org --0016361648edab2c43047a99d074 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Problem solved. Now another problem comes. As I want to use Highlighter in my system, the token offset is incorrect after the MappingCharFilter is used. Koji, do you known how to fix the offset problem? On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang wrote: > I use Luke to check the result and find only c exists as a term, no > cplusplus found in the index > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang wrote= : > >> Thanks, Koji, I followed your advice and change my analyzer as shown >> below: >> NormalizeCharMap RECOVERY_MAP =3D new NormalizeCharMap(); >> RECOVERY_MAP.add("c++","cplusplus$"); >> CharFilter filter =3D new LowercaseCharFilter(reader); >> filter =3D new MappingCharFilter(RECOVERY_MAP,filter); >> StandardTokenizer tokenStream =3D new StandardTokenizer(Version.LUCENE_3= 0, >> filter); >> tokenStream.setMaxTokenLength(maxTokenLength); >> TokenStream result =3D new StandardFilter(tokenStream); >> result =3D new LowerCaseFilter(result); >> result =3D new StopFilter(enableStopPositionIncrements, result, stopSet)= ; >> result =3D new SnowballFilter(result, STEMMER); >> >> I use the same analyzer in the search side. As you know, this analyzer c= an >> token c++ as cplusplus, for this reason, it seems I can search c++ with >> the same analyzer because it is also tokenized as cplusplus. >> >> I tested it on as string c++c++, however, when i search c++ on the built >> index, nothing is returned. >> >> I do not know what's wrong with my code. Waiting for your replay >> >> >> >> >> >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang wrote= : >> >>> Thanks, Koji >>> >>> >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi wro= te: >>> >>>> MappingCharFilter can be used to convert c++ to cplusplus. >>>> >>>> Koji >>>> >>>> -- >>>> http://www.rondhuit.com/en/ >>>> >>>> >>>> >>>> Anshum wrote: >>>> >>>>> How about getting the original token stream and then converting c++ t= o >>>>> cplusplus or anyother such transform. Or perhaps you might look at >>>>> using/extending(in the non java sense) some other tokenized! >>>>> >>>>> -- >>>>> Anshum Gupta >>>>> Naukri Labs! >>>>> http://ai-cafe.blogspot.com >>>>> >>>>> The facts expressed here belong to everybody, the opinions to me. The >>>>> distinction is yours to draw............ >>>>> >>>>> >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang >>>>> wrote: >>>>> >>>>> >>>>> >>>>>> Hi, all, >>>>>> I designed a ftp search engine based on Lucene. I did a few >>>>>> modifications to the StandardTokenizer. >>>>>> My problem is: >>>>>> C++ is tokenized as c from StandardTokenizer and I want to recover = it >>>>>> from >>>>>> the TokenStream from StandardTokenizer >>>>>> >>>>>> What should I do? >>>>>> >>>>>> -- >>>>>> Weiwei Wang >>>>>> Alex Wang >>>>>> =E7=8E=8B=E5=B7=8D=E5=B7=8D >>>>>> Room 403, Mengmin Wei Building >>>>>> Computer Science Department >>>>>> Gulou Campus of Nanjing University >>>>>> Nanjing, P.R.China, 210093 >>>>>> >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>> >>> >>> -- >>> Weiwei Wang >>> Alex Wang >>> =E7=8E=8B=E5=B7=8D=E5=B7=8D >>> Room 403, Mengmin Wei Building >>> Computer Science Department >>> Gulou Campus of Nanjing University >>> Nanjing, P.R.China, 210093 >>> >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang >>> >> >> >> >> -- >> Weiwei Wang >> Alex Wang >> =E7=8E=8B=E5=B7=8D=E5=B7=8D >> Room 403, Mengmin Wei Building >> Computer Science Department >> Gulou Campus of Nanjing University >> Nanjing, P.R.China, 210093 >> >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > > -- > Weiwei Wang > Alex Wang > =E7=8E=8B=E5=B7=8D=E5=B7=8D > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > --=20 Weiwei Wang Alex Wang =E7=8E=8B=E5=B7=8D=E5=B7=8D Room 403, Mengmin Wei Building Computer Science Department Gulou Campus of Nanjing University Nanjing, P.R.China, 210093 Homepage: http://cs.nju.edu.cn/rl/weiweiwang --0016361648edab2c43047a99d074--