Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82560 invoked from network); 22 Feb 2007 13:05:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Feb 2007 13:05:22 -0000 Received: (qmail 89119 invoked by uid 500); 22 Feb 2007 13:05:12 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 89089 invoked by uid 500); 22 Feb 2007 13:05:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 89071 invoked by uid 99); 22 Feb 2007 13:05:11 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Feb 2007 05:05:11 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [62.213.161.130] (HELO redhat.sirma.bg) (62.213.161.130) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Feb 2007 05:05:00 -0800 Received: from [192.168.128.140] ([192.168.128.140]) (authenticated bits=0) by redhat.sirma.bg (8.12.7/8.12.7/Sirma Linux 0.6) with ESMTP id l1MD4Hct029003 for ; Thu, 22 Feb 2007 15:04:22 +0200 Message-ID: <45DD94D0.8010901@sirma.bg> Date: Thu, 22 Feb 2007 15:04:16 +0200 From: Ivan Vasilev User-Agent: Thunderbird 1.5.0.9 (Windows/20061207) MIME-Version: 1.0 To: LUCENE MAIL LIST Subject: Multy Language documents indexing Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: by Sirma Antivirus System X-Virus-Checked: Checked by ClamAV on apache.org Hi All, Our application that uses Lucene for indexing will be used to index documents that each of which contains parts written in different languages. For example some document could contain English, Chinese and Brazilian text. So how to index such document? Is there some best practice to do this? What comes in my mind is to index 3 different Lucene Documents for the real document and keep in a database the meta info that these 3 Documents are related to our real doc. For example for the myDoc.doc we will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when making search when the searched word is found in myDocCn.doc we will visualize to user myDoc.doc. Disadvantage here is that in this case the occurrences of the searched item will have to be recalculated. It is important for queries like �Red NEAR/10 fox�. So if someone knows better practice than this, please let me help. Tanks in advance, Ivan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org