lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken McCracken <ken.mccrac...@gmail.com>
Subject Similarity score computation documentation
Date Tue, 14 Sep 2004 00:20:32 GMT
Hi,

I was looking through the score computation when running search, and
think there may be a discrepancy between what is _documented_ in the
org.apache.lucene.search.Similarity class overview Javadocs, and what
actually occurs in the code.

I believe the problem is only with the documentation.

I'm pretty sure that there should be an idf^2 in the sum.  Look at
org.apache.lucene.search.TermQuery, the inner class TermWeight.  You
can see that first sumOfSquaredWeights() is called, followed by
normalize(), during search.  Further, the resulting value stored in
the field "value" is set as the "weightValue" on the TermScorer.

If we look at what happens to TermWeight, sumOfSquaredWeights() sets
"queryWeight" to idf * boost.  During normalize(), "queryWeight" is
multiplied by the query norm, and "value" is set to queryWeight * idf
== idf * boost * query norm * idf == idf^2 * boost * query norm.  This
becomes the "weightValue" in the TermScorer that is then used to
multiply with the appropriate tf, etc., values.

The remaining terms in the Similarity description are properly
appended.  I also see that the queryNorm effectively "cancels out"
(dimensionally, since it is a 1/ square root of a sum of squares of
idfs) one of the idfs, so the formula still ends up being roughly a
TF-IDF formula.  But the idf^2 should still be there, along with the
expansion of queryNorm.

Am I mistaken, or is the documentation off?

Thanks for your help,
-Ken

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message