From java-user-return-34322-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Thu Jun 05 17:20:52 2008 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 60178 invoked from network); 5 Jun 2008 17:20:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Jun 2008 17:20:52 -0000 Received: (qmail 43957 invoked by uid 500); 5 Jun 2008 17:20:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 43911 invoked by uid 500); 5 Jun 2008 17:20:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 43885 invoked by uid 99); 5 Jun 2008 17:20:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2008 10:20:47 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michaelysiu@hotmail.com designates 65.54.246.208 as permitted sender) Received: from [65.54.246.208] (HELO bay0-omc3-s8.bay0.hotmail.com) (65.54.246.208) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2008 17:19:50 +0000 Received: from hotmail.com ([65.55.133.18]) by bay0-omc3-s8.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 5 Jun 2008 10:20:12 -0700 Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Thu, 5 Jun 2008 10:20:12 -0700 Message-ID: Received: from 209.189.193.162 by BAY128-DAV8.phx.gbl with DAV; Thu, 05 Jun 2008 17:20:09 +0000 X-Originating-IP: [209.189.193.162] X-Originating-Email: [michaelysiu@hotmail.com] X-Sender: michaelysiu@hotmail.com From: "Michael Siu" To: References: <3A3FE2E4-0478-4DA7-A22E-064B9CC4274B@apache.org> <359a92830806050951u6f5da79i8b039f6d0081f47c@mail.gmail.com> In-Reply-To: <359a92830806050951u6f5da79i8b039f6d0081f47c@mail.gmail.com> Subject: RE: How international languages are supported in Lucene Date: Thu, 5 Jun 2008 10:20:30 -0700 Message-ID: <00a001c8c730$746e1c30$5d4a5490$@com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AcjHLHZ2skZ354MxQbG46/eDiuP0nwAA/FUQ Content-Language: en-us X-OriginalArrivalTime: 05 Jun 2008 17:20:12.0176 (UTC) FILETIME=[69760900:01C8C730] X-Virus-Checked: Checked by ClamAV on apache.org Thanks Erick. -----Original Message----- From: Erick Erickson [mailto:erickerickson@gmail.com] Sent: Thursday, June 05, 2008 9:51 AM To: java-user@lucene.apache.org Subject: Re: How international languages are supported in Lucene See below On Thu, Jun 5, 2008 at 12:04 PM, Michael Siu wrote: > Grant, > > Thanks for the timely reply. :-) > > No, we do not have a specific language in mind. Basically, our document > source could potentially contain any language in the world. Supporting > English, Spanish, Italian, French, Chinese, Russian and Japanese would be > the minimum set. > > Do you mean we will need different analyzer for each language? Then is that > means we will need to know the language type of a document before we can > index it? > yes and yes. Try searching the mail archives for things like multi-language and you'll find this topic discussed ad-nauseum But basically consider why this must be so, especially when stemming. Languages are so variable that you'd get wildly different (and inappropriate) results if you tried to analyze them with the same analyzer. Especially when you get different language encodings in the document. Best Erick > > Thanks again. > > > > -----Original Message----- > From: Grant Ingersoll [mailto:gsingers@apache.org] > Sent: Thursday, June 05, 2008 8:53 AM > To: java-user@lucene.apache.org > Subject: Re: How international languages are supported in Lucene > > Hi Michael, > > That's a pretty open ended question and, I'm assuming, by > "international languages" you mean non-English :-). You might get > some mileage out of > http://wiki.apache.org/lucene-java/IndexingOtherLanguages > but it is a bit out of date (namely the sandbox references). > Lucene indexes non-English languages just like it does English. You > need to figure out what Analyzer you need (have a look in the contrib/ > Analyzers code/javadocs for many existing languages) and then pretty > much everything else is the same. Namely, the same principals apply > (what to store, index, etc.), as they do in English. > > Did you have something specific in mind? i.e. how to handle Chinese > or some specific language? Lastly, if you do have a language in mind, > try searching the mail archives for the name of that language. > > HTH, > Grant > > On Jun 5, 2008, at 11:32 AM, Michael Siu wrote: > > > Would someone tell me how Lucene supports indexing and searching > > documents > > that contain international languages? What do I need to do in > > additions to > > using the StandardAnalyzer? > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org