lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Designing a multilingual index
Date Tue, 03 Jan 2012 13:51:49 GMT

Le 3 janv. 2012 à 13:56, heikki a écrit :

> In our case, it is "known" in which language the user is searching (because
> he tells us, and if he doesn't, we use the current GUI language).

On the web it is often hard to trust such (e.g. because of people working in multiple languages,
internet cafés...) but... it is your choice.

> Results
> are returned so that results in the requested language are ordered on top,
> and within that, ordered by relevance. Results in other languages are also
> returned, and presented after the requested-language results, ordered by
> relevance.

Would "shallow matches" in the right language come after "precise matches" in a wrong language?

> If the results in the requested language contain say one which has term A
> and one which has term B, their positions in the relevance ranking (within
> the requested-language results on top) can be influenced by occurrences of
> terms A and B in the other languages, if a single search is used.
> I agree to the apples/oranges remark: if a term occurs in more than one
> language, likely its IDF frequency is different for each language, so to
> have the best relevance ranking there should be separate indexes for each
> language. And searches should be really separate searches (no MultiSearcher
> which would produce combined relevance scores). So the results should also
> be presented as several, separate result sets.

I believe the right solution for this is simple: use different fields per langauge.

In both solr and simple lucene, using different fields allows different analyzers, that's
how you want things (e.g. a different stemmer per language).

Using different indexes is certainly a hassle, different fields not really.

The important bit is to use query-expansion.
Given a query of the user (with params or not, with text-queries), expand it to a query where
the "normal text" is expected to be in the right language, but maybe also in one of the other
languages (that the browser says, that your platform supports), with less weight of course.

Query expansion is done by post-processing the result of the query-parser in my case.

Then you can also differentiate fields which are precise matches and less: make one field
with exact match (using the whitespace-tokenizer), one field with stemmed match (e.g. using
the porter family), one field with phonetic matches.

Hope it helps.


> Does anyone have experience with this ? Opinions ? Is the improved relevance
> per language worth the "hassle" of having separate indexes, doing separate
> searches and presenting results per language ? We do already take care of
> using appropriate stopwords/differnt analyzers when indexing and searching a
> particular language, but that's a different issue obviously.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message