lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Similarity scores: tf(), lengthNorm(), sumOfSquaredWeights().
Date Mon, 20 Sep 2004 18:52:09 GMT

After last week's discussion on idf() of the similarity score computation
I looked into the score computation a bit deeper.

In the DefaultSimilarity tf() is the sqrt() and lengthNorm() is the inverse
of sqrt(). That means that the factor (docTf * docNorm) actually
implements the square root of the density of the query term in the
document field (ignoring the encoding and decoding of the norm).

Summing these weighted square roots resembles a 
Salton OR p-Norm for p = 1/2, except that Salton defined
the p-Norm's for p >= 1, and the result is more like an AND
p-Norm because it depends mostly on the minimum argument.

The pnorm also requires that the sum is taken to the power 1/p,
but this is not necessary as it would not change the ranking.

I looked around for p-Norm's with 0<p<1, but I didn't find
anything. Is there really nothing about this? A good discussion is here:
http://elvis.slis.indiana.edu/irpub/SIGIR/1994/cite19.htm

I would guess that since the sqrt() has an infinite derivative at zero, it
might well be that this OR p-Norm for p = 1/2 behaves much like a
rather high power AND p-Norm.

The basic summing form of the OR p-Norm also allows a very easy
implementation by just summing the weighted square roots; an AND
p-Norm for p >= 1 would have needed some more calculations.
Is this perhaps one of the reasons for using a power p  < 1 ?

Taking this a bit further, I also wonder about the name of
sumOfSquaredWeights() in the Weight interface.
Shouldn't that rather be  sumOfPowerWeights() and 
by default implement a sum of square roots?
This would allow a more straightforward comprehension
of the of the term weights as directly weighing the term densities.

Section 5 of the reference above has the full weighted
p-Norm formula's. The OR p-Norm there is very close
to the Lucene formula without coord().

Regards,
Paul Elschot

On Tuesday 14 September 2004 23:49, Doug Cutting wrote:
> Your analysis sounds correct.
>
> At base, a weight is a normalized tf*idf.  So a document weight is:
>
>    docTf * idf * docNorm
>
> and a query weight is:
>
>    queryTf * idf * queryNorm
>
> where queryTf is always one.
>
> So the product of these is (docTf * idf * docNorm) * (idf * queryNorm),
> which indeed contains idf twice.  I think the best documentation fix
> would be to add another idf(t) clause at the end of the formula, next to
> queryNorm(q), so this is clear.  Does that sound right to you?
>
> Doug
>
> Ken McCracken wrote:
> > Hi,
> >
> > I was looking through the score computation when running search, and
> > think there may be a discrepancy between what is _documented_ in the
> > org.apache.lucene.search.Similarity class overview Javadocs, and what
> > actually occurs in the code.
> >
> > I believe the problem is only with the documentation.
> >
> > I'm pretty sure that there should be an idf^2 in the sum.  Look at
> > org.apache.lucene.search.TermQuery, the inner class TermWeight.  You
> > can see that first sumOfSquaredWeights() is called, followed by
> > normalize(), during search.  Further, the resulting value stored in
> > the field "value" is set as the "weightValue" on the TermScorer.
> >
> > If we look at what happens to TermWeight, sumOfSquaredWeights() sets
> > "queryWeight" to idf * boost.  During normalize(), "queryWeight" is
> > multiplied by the query norm, and "value" is set to queryWeight * idf
> > == idf * boost * query norm * idf == idf^2 * boost * query norm.  This
> > becomes the "weightValue" in the TermScorer that is then used to
> > multiply with the appropriate tf, etc., values.
> >
> > The remaining terms in the Similarity description are properly
> > appended.  I also see that the queryNorm effectively "cancels out"
> > (dimensionally, since it is a 1/ square root of a sum of squares of
> > idfs) one of the idfs, so the formula still ends up being roughly a
> > TF-IDF formula.  But the idf^2 should still be there, along with the
> > expansion of queryNorm.
> >
> > Am I mistaken, or is the documentation off?
> >
> > Thanks for your help,
> > -Ken
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message