Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
Message-ID: <45DD94D0.8010901@sirma.bg>
Date: Thu, 22 Feb 2007 15:04:16 +0200
From: Ivan Vasilev <ivasilev@sirma.bg>
User-Agent: Thunderbird 1.5.0.9 (Windows/20061207)
MIME-Version: 1.0
To: LUCENE MAIL LIST <java-user@lucene.apache.org>
Subject: Multy Language documents indexing
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit

Hi All,

Our application that uses Lucene for indexing will be used to index 
documents that each of which contains parts written in different 
languages. For example some document could contain English, Chinese and 
Brazilian text. So how to index such document? Is there some best 
practice to do this?

What comes in my mind is to index 3 different Lucene Documents for the 
real document and keep in a database the meta info that these 3 
Documents are related to our real doc. For example for the myDoc.doc we 
will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when 
making search when the searched word is found in myDocCn.doc we will 
visualize to user myDoc.doc. Disadvantage here is that in this case the 
occurrences of the searched item will have to be recalculated. It is 
important for queries like �Red NEAR/10 fox�. So if someone knows better 
practice than this, please let me help.

Tanks in advance,
Ivan


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org