Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of torindan@gmail.com designates
 209.85.212.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=drjAzQOhkulmwfNgCREreYb6Zu577qQsXd/7dnLKp34QH4Sc+rOeEyJpsMivjPHdPa
         cu1+wmSA5HkHmcGIRym8ByRqxymoE8Y58Qukfpk/m1zOdwqEf2aucUeGQxgdVDiUenE+
         cdYw75i9b47mxJ1ShA6KC1O1AvN1NuSq/pKi8=
MIME-Version: 1.0
In-Reply-To: <137B4E3B8F30074EADD080AA8A6E2A9312950D4AE4@ZDE070.lenze.com>
References: <137B4E3B8F30074EADD080AA8A6E2A9312950D4AE4@ZDE070.lenze.com>
Date: Thu, 1 Jul 2010 12:30:21 +0300
Message-ID: <AANLkTikcGfPdbfV3CcVBfK_8iVWC8hX8EC09PT9x51rl@mail.gmail.com>
Subject: Re: Lucene and Chinese language
From: =?UTF-8?B?RGFuaWwgxaJPUklO?= <torindan@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e64b03988ae102048a501e2b

--0016e64b03988ae102048a501e2b
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Try to use CJK analyzer for both indexing and searching chinese language.
Then you won't need "text"->"*text*" transformation.

There might be some false positives in the results though.
You can also may want to try smartcn analyzer which is dictionary based, bu=
t
I have no expertise to evaluate the results (we still use CJK for asian
languages, as there are no complains so far)


2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff@encoway.de>

>
> Hi!
>
> We are using lucene in our project to search through information objects
> which works fine. For indexing we use the StandardAnalyzer.
> Now, we have to support the Chinese language. I found out that the Chines=
e
> words and letters are correctly saved in the index but the query to searc=
h
> for them does not work. Example: in English language the query is =E2=80=
=9Ctext=E2=80=9D
> which we parse to =E2=80=9C*text*=E2=80=9D. If we search for Chinese word=
s / phrases like
> =E2=80=9C=E4=BD=9B=E5=B1=B1=E4=B8=9C=E6=96=B9=E4=B9=A6=E5=9F=8E=E2=80=9Dt=
he query is =E2=80=9C*=E4=BD=9B=E5=B1=B1=E4=B8=9C=E6=96=B9=E4=B9=A6=E5=9F=
=8E*=E2=80=9C but there are no search results. If the
> query places blanks between the single letters / symbols like this =E2=80=
=9C*=E4=BD=9B =E5=B1=B1 =E4=B8=9C =E6=96=B9
> =E4=B9=A6 =E5=9F=8E*=E2=80=9C we are getting results. Does the StandardAn=
alyzer interpret each
> Chinese letter as one word? What are best practices for this case? Shall =
we
> use another analyzer (Chinese analyzer)? Or is it better to replace the
> query parser in this case?
>
> Regards,
> Jacqueline.
>

--0016e64b03988ae102048a501e2b--