Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 26855 invoked from network); 1 Jul 2010 09:31:54 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Jul 2010 09:31:54 -0000 Received: (qmail 48912 invoked by uid 500); 1 Jul 2010 09:31:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 48809 invoked by uid 500); 1 Jul 2010 09:31:49 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 48800 invoked by uid 99); 1 Jul 2010 09:31:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jul 2010 09:31:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of torindan@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jul 2010 09:31:42 +0000 Received: by vws19 with SMTP id 19so3079519vws.35 for ; Thu, 01 Jul 2010 02:30:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=eKRkz4hXlLpAzV6ZnZhzUSl8VTAS/zlRuoBO4BinsZ0=; b=cnf8nizkFXKx0AGK0SlAXZO2Bt29BOihsLidiOHw9eKCrND5KnAkGCg8q7s3D0ZOkg kJ4D4OL4NEfY0wWBmRDQj/OLvYnGUQz5TU+PGgbsb6/APfpiEo1OdFjCj1oWiLKp5dpV LdkfiHsF+tFalkpzloj5jTOUBsYPW4rrIr5G4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=drjAzQOhkulmwfNgCREreYb6Zu577qQsXd/7dnLKp34QH4Sc+rOeEyJpsMivjPHdPa cu1+wmSA5HkHmcGIRym8ByRqxymoE8Y58Qukfpk/m1zOdwqEf2aucUeGQxgdVDiUenE+ cdYw75i9b47mxJ1ShA6KC1O1AvN1NuSq/pKi8= MIME-Version: 1.0 Received: by 10.229.184.10 with SMTP id ci10mr5928596qcb.138.1277976621277; Thu, 01 Jul 2010 02:30:21 -0700 (PDT) Received: by 10.229.96.195 with HTTP; Thu, 1 Jul 2010 02:30:21 -0700 (PDT) In-Reply-To: <137B4E3B8F30074EADD080AA8A6E2A9312950D4AE4@ZDE070.lenze.com> References: <137B4E3B8F30074EADD080AA8A6E2A9312950D4AE4@ZDE070.lenze.com> Date: Thu, 1 Jul 2010 12:30:21 +0300 Message-ID: Subject: Re: Lucene and Chinese language From: =?UTF-8?B?RGFuaWwgxaJPUklO?= To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e64b03988ae102048a501e2b X-Virus-Checked: Checked by ClamAV on apache.org --0016e64b03988ae102048a501e2b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Try to use CJK analyzer for both indexing and searching chinese language. Then you won't need "text"->"*text*" transformation. There might be some false positives in the results though. You can also may want to try smartcn analyzer which is dictionary based, bu= t I have no expertise to evaluate the results (we still use CJK for asian languages, as there are no complains so far) 2010/7/1 Kolhoff, Jacqueline - ENCOWAY > > Hi! > > We are using lucene in our project to search through information objects > which works fine. For indexing we use the StandardAnalyzer. > Now, we have to support the Chinese language. I found out that the Chines= e > words and letters are correctly saved in the index but the query to searc= h > for them does not work. Example: in English language the query is =E2=80= =9Ctext=E2=80=9D > which we parse to =E2=80=9C*text*=E2=80=9D. If we search for Chinese word= s / phrases like > =E2=80=9C=E4=BD=9B=E5=B1=B1=E4=B8=9C=E6=96=B9=E4=B9=A6=E5=9F=8E=E2=80=9Dt= he query is =E2=80=9C*=E4=BD=9B=E5=B1=B1=E4=B8=9C=E6=96=B9=E4=B9=A6=E5=9F= =8E*=E2=80=9C but there are no search results. If the > query places blanks between the single letters / symbols like this =E2=80= =9C*=E4=BD=9B =E5=B1=B1 =E4=B8=9C =E6=96=B9 > =E4=B9=A6 =E5=9F=8E*=E2=80=9C we are getting results. Does the StandardAn= alyzer interpret each > Chinese letter as one word? What are best practices for this case? Shall = we > use another analyzer (Chinese analyzer)? Or is it better to replace the > query parser in this case? > > Regards, > Jacqueline. > --0016e64b03988ae102048a501e2b--