After last week's discussion on idf() of the similarity score computation
I looked into the score computation a bit deeper.
In the DefaultSimilarity tf() is the sqrt() and lengthNorm() is the inverse
of sqrt(). That means that the factor (docTf * docNorm) actually
implements the square root of the density of the query term in the
document field (ignoring the encoding and decoding of the norm).
Summing these weighted square roots resembles a
Salton OR pNorm for p = 1/2, except that Salton defined
the pNorm's for p >= 1, and the result is more like an AND
pNorm because it depends mostly on the minimum argument.
The pnorm also requires that the sum is taken to the power 1/p,
but this is not necessary as it would not change the ranking.
I looked around for pNorm's with 0<p<1, but I didn't find
anything. Is there really nothing about this? A good discussion is here:
http://elvis.slis.indiana.edu/irpub/SIGIR/1994/cite19.htm
I would guess that since the sqrt() has an infinite derivative at zero, it
might well be that this OR pNorm for p = 1/2 behaves much like a
rather high power AND pNorm.
The basic summing form of the OR pNorm also allows a very easy
implementation by just summing the weighted square roots; an AND
pNorm for p >= 1 would have needed some more calculations.
Is this perhaps one of the reasons for using a power p < 1 ?
Taking this a bit further, I also wonder about the name of
sumOfSquaredWeights() in the Weight interface.
Shouldn't that rather be sumOfPowerWeights() and
by default implement a sum of square roots?
This would allow a more straightforward comprehension
of the of the term weights as directly weighing the term densities.
Section 5 of the reference above has the full weighted
pNorm formula's. The OR pNorm there is very close
to the Lucene formula without coord().
Regards,
Paul Elschot
On Tuesday 14 September 2004 23:49, Doug Cutting wrote:
> Your analysis sounds correct.
>
> At base, a weight is a normalized tf*idf. So a document weight is:
>
> docTf * idf * docNorm
>
> and a query weight is:
>
> queryTf * idf * queryNorm
>
> where queryTf is always one.
>
> So the product of these is (docTf * idf * docNorm) * (idf * queryNorm),
> which indeed contains idf twice. I think the best documentation fix
> would be to add another idf(t) clause at the end of the formula, next to
> queryNorm(q), so this is clear. Does that sound right to you?
>
> Doug
>
> Ken McCracken wrote:
> > Hi,
> >
> > I was looking through the score computation when running search, and
> > think there may be a discrepancy between what is _documented_ in the
> > org.apache.lucene.search.Similarity class overview Javadocs, and what
> > actually occurs in the code.
> >
> > I believe the problem is only with the documentation.
> >
> > I'm pretty sure that there should be an idf^2 in the sum. Look at
> > org.apache.lucene.search.TermQuery, the inner class TermWeight. You
> > can see that first sumOfSquaredWeights() is called, followed by
> > normalize(), during search. Further, the resulting value stored in
> > the field "value" is set as the "weightValue" on the TermScorer.
> >
> > If we look at what happens to TermWeight, sumOfSquaredWeights() sets
> > "queryWeight" to idf * boost. During normalize(), "queryWeight" is
> > multiplied by the query norm, and "value" is set to queryWeight * idf
> > == idf * boost * query norm * idf == idf^2 * boost * query norm. This
> > becomes the "weightValue" in the TermScorer that is then used to
> > multiply with the appropriate tf, etc., values.
> >
> > The remaining terms in the Similarity description are properly
> > appended. I also see that the queryNorm effectively "cancels out"
> > (dimensionally, since it is a 1/ square root of a sum of squares of
> > idfs) one of the idfs, so the formula still ends up being roughly a
> > TFIDF formula. But the idf^2 should still be there, along with the
> > expansion of queryNorm.
> >
> > Am I mistaken, or is the documentation off?
> >
> > Thanks for your help,
> > Ken
> >
> > 
> > To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
> > For additional commands, email: luceneuserhelp@jakarta.apache.org
>
> 
> To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
> For additional commands, email: luceneuserhelp@jakarta.apache.org

To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
For additional commands, email: luceneuserhelp@jakarta.apache.org
