lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Indexing multiple languages
Date Wed, 01 Jun 2005 08:10:24 GMT
Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
>> 1/ one index for all languages
>> 2/ one index for all languages, with an extra language field so 
>> searches
>> can be constrained to a particular language
>> 3/ separate indices for each language?
> I would vote for option #2 as it gives the most flexibilty - you can 
> query with or without concern for language.

The way I've solved this is to make a different field-name per-language 
as our documents can be multilingual.
What's then done is query expansion at query time: given a term-query 
for text, I duplicate it for each accepted language of the user with a 
factor related to the preference of the language (e.g. the q factor in 
Accept-Language http header). Presumably I could be using solution 2/ 
as well if my queries become too big, making several documents for each 
language of the document.

I think it's very important to care about guessing the accepted 
languages of the user. Typically, the default behaviour of Google is to 
only give you matches in your primary language but then allow expansion 
in any language.

>> On the other hand, if people are searching for proper nouns in 
>> metadata
>> (e.g. "DSpace") it may be advantageous to search all languages at 
>> once.

This one may need particular treatment.

Tell us your success!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message