lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Designing a multilingual index
Date Tue, 03 Jan 2012 14:29:11 GMT

it does solve your main concern: a term in lucene is a pair of a token and field name.
The term frequency is, thus, the frequency of a token in a field.

So the term-frequency of text-stemmed-de:firewall is independent of the term-frequency of
text-stemmed-en:firewall (for example).

But using the query expansion mechanism, it is likely that both term-queries will be present
and both contribute to the score. Which is correct I think.


Le 3 janv. 2012 à 15:06, heikki a écrit :
>> The important bit is to use query-expansion.
>> Given a query of the user (with params or not, with text-queries), expand
>> it to a query where the "normal text" is expected to be in the right
>> language, but maybe also in one of the other languages (that
>> the browser says, that your platform supports), with less weight of
> course.
> something like that we do now in a single index solution - results in the
> requested language are boosted enough so they're always on top
> I don't think though that this addresses what is my main point: the
> frequency of terms in different domains (in this case, different languages)
> is different for each domain. This means that if the domains are chunked
> together in one index, the IDF value for a term is less "accurate" than if
> multiple, separate indexes were used. A term is more or less frequent in
> one domain or another, for a reason.. Relevance ranking is impacted by
> that, and is more accurate if separate indexes are used -- I think this
> seems logical.
> I just don't know how much impact it really has, and whether it is worth to
> deal with it by presenting separate result sets from separate index
> searches ..

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message