lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael D. Curtin" <m...@curtin.com>
Subject Re: a "fair" similarity
Date Tue, 15 Aug 2006 01:26:40 GMT
Daniel Naber wrote:
> Hi,
> 
> as some of you may have noticed, Lucene prefers shorter documents over 
> longer ones, i.e. shorter documents get a higher ranking, even if the 
> ratio "matched terms / total terms in document" is the same.
> 
> For example, take these two artificial documents:
> 
> doc1: x 2 3 4 5 6 7 8 9 10
> doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 
> When searching for "x" doc1 will get a higher ranking, even though "x" 
> makes up 1/10 of the terms in both documents.

I think it depends upon what you want "similar" to mean.  The shorter doc 
thing comes from the "parsimony" concept, if I remember my Information Theory 
correctly.  In other words, the less data to get to a given result (1/10 "x" 
in your example) the better.  It sounds like you want doc1 and doc2 to be 
considered exactly similar, at least for "x".  Would you want doc3 below to be 
treated the same way?

doc3: x  2  3  4  5  6  7  8  9 10
       x 12 13 14 15 16 17 18 19 20
       x 22 ...                  30
       x 32 ...                  40
                             ... 1000

In some situations, the appearance of "x" is more significant in doc1, because 
hardly anything is there in the first place.  I think that tends to be more 
common in English prose, which may be why it's the default in Lucene.

I think your proposed formula would treat all docs, 1-3, the same.  If that's 
what you want, I'd say you're good to go.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message