lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Designing a multilingual index
Date Tue, 03 Jan 2012 15:10:59 GMT
I think the idf is also about terms and not about tokens.
Maybe an expert can confirm my belief or we have to invent a test.


Le 3 janv. 2012 à 15:43, heikki a écrit :

> hi Paul,
> yes, but my concern isn't about the term-frequency, but rather the
> inverted-document-frequency, which also is used in the relevance score and
> which takes into account all documents in the index.. in this way the
> relevance score of one document is influenced by the contents of all other
> documents that are in the same index. This is why it seems logical to me
> that if different domains use separate indexes, the relevance scoring is
> more accurate.
> Kind regards,
> Heikki Doeleman
> On Tue, Jan 3, 2012 at 3:29 PM, Paul Libbrecht <> wrote:
>> Heikki,
>> it does solve your main concern: a term in lucene is a pair of a token and
>> field name.
>> The term frequency is, thus, the frequency of a token in a field.
>> So the term-frequency of text-stemmed-de:firewall is independent of the
>> term-frequency of text-stemmed-en:firewall (for example).
>> But using the query expansion mechanism, it is likely that both
>> term-queries will be present and both contribute to the score. Which is
>> correct I think.
>> paul
>> Le 3 janv. 2012 à 15:06, heikki a écrit :
>>>> The important bit is to use query-expansion.
>>>> Given a query of the user (with params or not, with text-queries),
>> expand
>>>> it to a query where the "normal text" is expected to be in the right
>>>> language, but maybe also in one of the other languages (that
>>>> the browser says, that your platform supports), with less weight of
>>> course.
>>> something like that we do now in a single index solution - results in the
>>> requested language are boosted enough so they're always on top
>>> I don't think though that this addresses what is my main point: the
>>> frequency of terms in different domains (in this case, different
>> languages)
>>> is different for each domain. This means that if the domains are chunked
>>> together in one index, the IDF value for a term is less "accurate" than
>> if
>>> multiple, separate indexes were used. A term is more or less frequent in
>>> one domain or another, for a reason.. Relevance ranking is impacted by
>>> that, and is more accurate if separate indexes are used -- I think this
>>> seems logical.
>>> I just don't know how much impact it really has, and whether it is worth
>> to
>>> deal with it by presenting separate result sets from separate index
>>> searches ..

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message