Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 13644 invoked from network); 26 Apr 2006 02:53:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 26 Apr 2006 02:53:43 -0000 Received: (qmail 62030 invoked by uid 500); 26 Apr 2006 02:53:38 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 62001 invoked by uid 500); 26 Apr 2006 02:53:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61986 invoked by uid 99); 26 Apr 2006 02:53:37 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Apr 2006 19:53:37 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [203.217.22.128] (HELO file1.syd.nuix.com.au) (203.217.22.128) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Apr 2006 19:53:36 -0700 Received: from [192.168.222.102] (host102.syd.nuix.com.au [192.168.222.102]) by file1.syd.nuix.com.au (Postfix) with ESMTP id 6C5CFB735C for ; Wed, 26 Apr 2006 12:53:06 +1000 (EST) Message-ID: <444EE15D.5020500@nuix.com.au> Date: Wed, 26 Apr 2006 12:56:29 +1000 From: Daniel Noll Organization: NUIX Pty Limited User-Agent: Thunderbird 1.5.0.2 (Windows/20060308) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Indexing with SnowballAnalyzer and multiple languages in a single index References: <9778DA4F3D53D04B9AE80AC64AC073A305745B4D@corpx.corp.dicarta.com> In-Reply-To: <9778DA4F3D53D04B9AE80AC64AC073A305745B4D@corpx.corp.dicarta.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N jwang@dicarta.com wrote: > You can have multiple languages in the same index. Just make sure that > your language identification process is consistent. > > You might still get some false positives, for example, if there's a > German root that has the same letters as a French root, but means > something different, then it might still show up. Personally, I don't > really know how many times that actually happens. > > Lucene treats all _post-analyze_ tokens the same, it is pretty much > language ignorant, so as long as the UTF characters are the same, it > treats the tokens as the same. I suppose one could work around that by prepending the language code to every token. Then those two words won't match each other, while stemming is preserved. The real problem as I see it is when two languages have an *identical* word, and the user types that in as their search query. Then you have to wonder which language it's from... perhaps you would just expand this to match multiple languages in the event of multiple matches. Or perhaps you would just add a little drop-down to the place they enter their query, where they can indicate what language the query is in. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699 Web: http://www.nuix.com.au/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org