Christoph,
I'd like to spend more time looking at this, but won't be able to until
tomorrow. It's a little confusing because the explain() mechanism is
not consistent with the actual score()'s. There's an additional
normalization being applied to bring score()'s into [0,1] that explain()
does not show.
Looking at the code, I think the cancellations you've made below are
obscuring the fact that idf is also squared in single Term scores, not
just when Term's occur in BooleanQuery's. At least that is consistent.
The inconsistency in the formulas troubled me when I first looked at it,
but it turns out it doesn't matter (so even if I'm wrong about the
single Term formula it doesn't matter). That's because idf is
irrelevant in a single term query, as it is a constant multiplier in all
results that is just normalized out.
I think there are at least two bugs here:
1. idf should not be squared.
2. explain() should explain the actual reported score().
I would venture a guess that these bugs are historical artifacts. Does
anybody know if normalization was introduced into the code after the
original scoring mechanisms were written? The idf's need to be
considered for normalization to work properly, which could have led to
the inadvertent squaring.
Chuck
> Original Message
> From: Christoph Goller [mailto:goller@detegosoftware.de]
> Sent: Wednesday, October 13, 2004 2:04 AM
> To: Lucene Developers List
> Subject: Search and Scoring
>
> > As an aside, is there a reason that idf is squared in each Term and
> > Phrase match (it is multiplied both into the query component and the
> > field component)? To compensate for this, I'm taking the square
root of
> > the idf I really want in my Similarity, which seems strange.
>
> Hi Chuck,
>
> that's a very good question. And you are right, it may be a bug, I am
> not sure about it. I stumbled over this several times when studying
> code in the search package. It's a little bit difficult to explain
since
> the code for score computation is distributed over Weight and Scorer
> classes. It seems that a TermQuery and a PhraseQuery weight is
> multiplied with idf twice, first in sumOfSquaredWeights() and then in
> normalize. That's what you discovered.
>
> The formula in Similarity Javadoc does not describe the scoring
completely.
> I try to write down the formula that exactly describes the current
> implementation. Then we can start a discussion and people could decide
> whether this is the intended scoring. (I assume DefaultSimilarity
here)
>
> Lt's start with the simple case. A pure TermQuery (one word query)
gets
> the following score after cancelling down queryNorm(t) and
queryBoost(t)
> (coord is 1 here)
>
> t: TermQuery
> d: document
>
> score(t, d) =
> tf(t in d) * idf(t) * fieldBoost(t.field in d) *
lengthFieldNorm(t.field
> in d)
>
> Note that fieldBoost and lengthNorm are both combined in norms.
>
> For a BooleanQuery consisting of several TermQueries we get the
following:
> (Again we can cancel down queryBoost(q))
>
> q: BooleanQuery
> t: Term and corresponding TermQuery
> d: document
>
> score(q, d) = coord(q, d) * queryNorm(q) *
> SUM_{t in q} ( tf(t in d) * idf(t)^2 * queryBoost(t) *
> fieldBoost(t.field in d)
> * lengthFieldNorm(t.field in d) )
>
> where
> coord(q, d) = "fraction of TermQueries occuring in d"
> queryNorm(q) = 1 / SQRT( SUM_{t in q} ( (idf(t) * queryBoost(t) )^2 )
)
>
> I hope this starts a discussion.
>
> Christoph
>
> 
> To unsubscribe, email: lucenedevunsubscribe@jakarta.apache.org
> For additional commands, email: lucenedevhelp@jakarta.apache.org

To unsubscribe, email: lucenedevunsubscribe@jakarta.apache.org
For additional commands, email: lucenedevhelp@jakarta.apache.org
