lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Multy Language documents indexing
Date Thu, 22 Feb 2007 13:11:28 GMT
I know this has been discussed several times, but sure don't remember the
answers. Search the mail archive for "multiple languages" and you'll find
some good suggestions. But as I remember, it's not a trivial issue.

But I don't see why the "three different documents" approach wouldn't work.
You could also index the same text in three different fields in a single
document, using different language analyzers for each (See
PerFieldAnalyzerWrapper).....

Erick

On 2/22/07, Ivan Vasilev <ivasilev@sirma.bg> wrote:
>
> Hi All,
>
> Our application that uses Lucene for indexing will be used to index
> documents that each of which contains parts written in different
> languages. For example some document could contain English, Chinese and
> Brazilian text. So how to index such document? Is there some best
> practice to do this?
>
> What comes in my mind is to index 3 different Lucene Documents for the
> real document and keep in a database the meta info that these 3
> Documents are related to our real doc. For example for the myDoc.doc we
> will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when
> making search when the searched word is found in myDocCn.doc we will
> visualize to user myDoc.doc. Disadvantage here is that in this case the
> occurrences of the searched item will have to be recalculated. It is
> important for queries like "Red NEAR/10 fox". So if someone knows better
> practice than this, please let me help.
>
> Tanks in advance,
> Ivan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message