Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 2152 invoked from network); 16 Dec 2009 03:54:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Dec 2009 03:54:10 -0000 Received: (qmail 34877 invoked by uid 500); 16 Dec 2009 03:54:08 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 34697 invoked by uid 500); 16 Dec 2009 03:54:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 34687 invoked by uid 99); 16 Dec 2009 03:54:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Dec 2009 03:54:04 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates 209.85.222.176 as permitted sender) Received: from [209.85.222.176] (HELO mail-pz0-f176.google.com) (209.85.222.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Dec 2009 03:54:00 +0000 Received: by pzk6 with SMTP id 6so383471pzk.29 for ; Tue, 15 Dec 2009 19:53:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=Mnb91A5AXQQY7xgtKVoP1QBE89G9Zgm2QRwIJ9RiGBk=; b=Evxu8DKO+7Y+SLyKtZnQSo84E2RbZ6Jk5eWnsF1Uhlprfd1GeUh4cPAD3mownvFgia SaFKrnxvgXkqBwSbTkeOt80lj6vnJHNp9e70mXHSAhBE5qxPz1y4JiOfcwljvOhpExo8 e96+GysrtTcGkbT5VshWKPpVlK/wPUP+I/Ji4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=pAFYBrpjkwN1p8O4g+JAvSXM/1hXLETHj9C6JDx7kQwf0HAtPKPtIKnb0EriqgAdBg KNSjFZX3Bbg+VKPMSDue7CvGOzqDJ6PMsdiUpSifLl2hKA5UFy5B9Hm7Wpf8WjFHJazQ +YwSZDKtLlJHlmLClvK+vNSsuIGXEAKfZ+T9M= MIME-Version: 1.0 Received: by 10.115.101.15 with SMTP id d15mr301503wam.200.1260935620087; Tue, 15 Dec 2009 19:53:40 -0800 (PST) In-Reply-To: <7d94dcde0912151919s5ec0fa40m968a94ff9d760529@mail.gmail.com> References: <7d94dcde0912150526hdc3f9cev7a7368d4df5f06f8@mail.gmail.com> <8f0ad1f30912150540i4d380bc2o43c9c5b079b923d7@mail.gmail.com> <7d94dcde0912150549t361a55c9va6a7cc785e062858@mail.gmail.com> <8f0ad1f30912150551v13fb0959vf3d011b2b969ea09@mail.gmail.com> <7d94dcde0912150613x509e04f1s6c3927cf61d210c4@mail.gmail.com> <8f0ad1f30912150619r40cfe55aj702dfbce86dbeb8@mail.gmail.com> <7d94dcde0912150623w5fcb8914o2e13a2bb0ea707a9@mail.gmail.com> <7d94dcde0912151919s5ec0fa40m968a94ff9d760529@mail.gmail.com> From: Robert Muir Date: Tue, 15 Dec 2009 22:53:20 -0500 Message-ID: <8f0ad1f30912151953y17c175e7n20ff02428e5eacad@mail.gmail.com> Subject: Re: How to do alias(Pinyin) search in Lucene To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e64dca14b84e9f047ad0733a --0016e64dca14b84e9f047ad0733a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, just one more thought for you. I think even more important than anything I said before, you should ensure you implement reusableTokenStream in your analyzer. this becomes a necessity if you are using expensive objects like this. 2009/12/15 Weiwei Wang > Finally, i make it run, however, it works so slow > > 2009/12/15 Weiwei Wang > > > got it, thanks, Robert > > > > > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir wrote: > > > >> if you have lucene 2.9 or 3.0 source code, just run patch -p0 < > >> /path/to/LUCENE-XXYY.patch from the lucene source code root directory.= .. > >> it > >> should create the necessary directory and files. > >> then run 'ant' , in this case it should create a lucene-icu jar file i= n > >> the > >> build directory. > >> > >> the patch doesnt include the icu dependency itself so you need to get > that > >> jar file from www.icu-project.org and have it in your classpath also > >> > >> sorry for the trouble, hope to integrate some of this soon for a futur= e > >> release. > >> > >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang > >> wrote: > >> > >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu > >> directory > >> > in my dowloaded contrib directory. > >> > > >> > I'm a rookie guy using patch, i'm currently in the contrib dir, coul= d > >> > anybody tell me how to execute this patch command to generate the > >> relevant > >> > dir and souce files? > >> > > >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir > wrote: > >> > > >> > > look at the latest patch file attached to the issue, it should wor= k > >> with > >> > > lucene 2.9 or greater (I think) > >> > > > >> > > 2009/12/15 Weiwei Wang > >> > > > >> > > > where can i find the source code? > >> > > > > >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir > >> wrote: > >> > > > > >> > > > > there is an icu transform tokenfilter in the patch here: > >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488 > >> > > > > > >> > > > > Transliterator pinyin =3D > >> Transliterator.getInstance("Han-Latin"); > >> > > > > Tokenizer tokenizer =3D new KeywordTokenizer(new > >> > StringReader("=E4=B8=AD=E5=9B=BD")); > >> > > > > ICUTransformFilter filter =3D new ICUTransformFilter(tokeni= zer, > >> > > pinyin); > >> > > > > assertTokenStreamContents(filter, new String[] { "zh=C5=8Dn= g gu=C3=B3" > } > >> ); > >> > > > > > >> > > > > note it will add tone marks and insert space between syllables > by > >> > > default > >> > > > > if you do not want this, you need to do some cleanup. > >> > > > > > >> > > > > Transliterator pinyin =3D > Transliterator.getInstance("Han-Latin; > >> > NFD; > >> > > > > [[:NonspacingMark:][:Space:]] Remove"); > >> > > > > Tokenizer tokenizer =3D new KeywordTokenizer(new > >> > StringReader("=E4=B8=AD=E5=9B=BD")); > >> > > > > ICUTransformFilter filter =3D new ICUTransformFilter(tokeni= zer, > >> > > pinyin); > >> > > > > assertTokenStreamContents(filter, new String[] { "zhongguo"= } > >> ); > >> > > > > > >> > > > > > >> > > > > 2009/12/15 Weiwei Wang > >> > > > > > >> > > > > > Hi, guys, > >> > > > > > I'm implementing a search engine based on Lucene for > >> Chinese. > >> > So > >> > > I > >> > > > > want > >> > > > > > to support pinyin search as Google China do. > >> > > > > > > >> > > > > > e.g. > >> > > > > > =E2=80=9C=E4=B8=AD=E5=9B=BD=E2=80=9D means Chinese in En= glish > >> > > > > > this word's pinyin input is "zhongguo" > >> > > > > > The feature i want to implement is when user type zhongguo t= he > >> > > results > >> > > > > will > >> > > > > > include documents containing "=E4=B8=AD=E5=9B=BD" or even Ch= inese > >> > > > > > > >> > > > > > Anybody here know how to achieve this? > >> > > > > > > >> > > > > > -- > >> > > > > > Weiwei Wang > >> > > > > > Alex Wang > >> > > > > > =E7=8E=8B=E5=B7=8D=E5=B7=8D > >> > > > > > Room 403, Mengmin Wei Building > >> > > > > > Computer Science Department > >> > > > > > Gulou Campus of Nanjing University > >> > > > > > Nanjing, P.R.China, 210093 > >> > > > > > > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > -- > >> > > > > Robert Muir > >> > > > > rcmuir@gmail.com > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > -- > >> > > > Weiwei Wang > >> > > > Alex Wang > >> > > > =E7=8E=8B=E5=B7=8D=E5=B7=8D > >> > > > Room 403, Mengmin Wei Building > >> > > > Computer Science Department > >> > > > Gulou Campus of Nanjing University > >> > > > Nanjing, P.R.China, 210093 > >> > > > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Robert Muir > >> > > rcmuir@gmail.com > >> > > > >> > > >> > > >> > > >> > -- > >> > Weiwei Wang > >> > Alex Wang > >> > =E7=8E=8B=E5=B7=8D=E5=B7=8D > >> > Room 403, Mengmin Wei Building > >> > Computer Science Department > >> > Gulou Campus of Nanjing University > >> > Nanjing, P.R.China, 210093 > >> > > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > >> > > >> > >> > >> > >> -- > >> Robert Muir > >> rcmuir@gmail.com > >> > > > > > > > > -- > > Weiwei Wang > > Alex Wang > > =E7=8E=8B=E5=B7=8D=E5=B7=8D > > Room 403, Mengmin Wei Building > > Computer Science Department > > Gulou Campus of Nanjing University > > Nanjing, P.R.China, 210093 > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > -- > Weiwei Wang > Alex Wang > =E7=8E=8B=E5=B7=8D=E5=B7=8D > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > --=20 Robert Muir rcmuir@gmail.com --0016e64dca14b84e9f047ad0733a--