lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From heikki <>
Subject Re: Designing a multilingual index
Date Tue, 03 Jan 2012 12:56:41 GMT

I would like to have your opinions on the impact on relevance scoring in the
scenario where multiple languages are indexed in a single index.

>> Besides, IMO, scoring / ordering documents in different 
>> languages is a bit like comparing apples and oranges. 
> Not too sure about that. If for instance you were to search for "firewall"
> in a German/English index, the word may appear in documents in both
> languages. Lucene's ranking algorithm is based on the number of tokens and
> the number of occurrences in fields, and while of course the ratio may
> vary (e.g. German tends to collate several words into one single entity,
> resulting in one single token, while English rather uses phrases,
> resulting in several tokens), I still think, assuming the user can
> understand both, that it kind of makes sense to rank a short German
> document where "firewall" occurs 10 times higher than an English document
> where the same word occurs only 5 times. 

In our case, it is "known" in which language the user is searching (because
he tells us, and if he doesn't, we use the current GUI language). Results
are returned so that results in the requested language are ordered on top,
and within that, ordered by relevance. Results in other languages are also
returned, and presented after the requested-language results, ordered by

If the results in the requested language contain say one which has term A
and one which has term B, their positions in the relevance ranking (within
the requested-language results on top) can be influenced by occurrences of
terms A and B in the other languages, if a single search is used.

I agree to the apples/oranges remark: if a term occurs in more than one
language, likely its IDF frequency is different for each language, so to
have the best relevance ranking there should be separate indexes for each
language. And searches should be really separate searches (no MultiSearcher
which would produce combined relevance scores). So the results should also
be presented as several, separate result sets.

Does anyone have experience with this ? Opinions ? Is the improved relevance
per language worth the "hassle" of having separate indexes, doing separate
searches and presenting results per language ? We do already take care of
using appropriate stopwords/differnt analyzers when indexing and searching a
particular language, but that's a different issue obviously.

thanks in advance,

Heikki Doeleman

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message