lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: a "fair" similarity
Date Tue, 21 Nov 2006 22:16:25 GMT
Michael D. Curtin wrote:
> Daniel Naber wrote:
>> Hi,
>> as some of you may have noticed, Lucene prefers shorter documents over 
>> longer ones, i.e. shorter documents get a higher ranking, even if the 
>> ratio "matched terms / total terms in document" is the same.

There's even more interesting kinds of "unfairness".

Suppose we have a document.  We can turn the
document into a query in the obvious way (a set
of boolean SHOULD clauses with term frequencies
given by counts in the doc).

Lucene's IDF scaling is only applied to the query.
This is great for performance, because the doc vectors
remain stable as new docs are added.

Then, in general:

score(doc,doc) < score(doc,doc')

if IDF(doc) = doc'.  That is, the inversely IDF-scaled
query matches a document better than the document itself.

- Bob Carpenter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message