> As an aside, is there a reason that idf is squared in each Term and
> Phrase match (it is multiplied both into the query component and the
> field component)? To compensate for this, I'm taking the square root of
> the idf I really want in my Similarity, which seems strange.
Hi Chuck,
that's a very good question. And you are right, it may be a bug, I am
not sure about it. I stumbled over this several times when studying
code in the search package. It's a little bit difficult to explain since
the code for score computation is distributed over Weight and Scorer
classes. It seems that a TermQuery and a PhraseQuery weight is
multiplied with idf twice, first in sumOfSquaredWeights() and then in
normalize. That's what you discovered.
The formula in Similarity Javadoc does not describe the scoring completely.
I try to write down the formula that exactly describes the current
implementation. Then we can start a discussion and people could decide
whether this is the intended scoring. (I assume DefaultSimilarity here)
Lt's start with the simple case. A pure TermQuery (one word query) gets
the following score after cancelling down queryNorm(t) and queryBoost(t)
(coord is 1 here)
t: TermQuery
d: document
score(t, d) =
tf(t in d) * idf(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field in d)
Note that fieldBoost and lengthNorm are both combined in norms.
For a BooleanQuery consisting of several TermQueries we get the following:
(Again we can cancel down queryBoost(q))
q: BooleanQuery
t: Term and corresponding TermQuery
d: document
score(q, d) = coord(q, d) * queryNorm(q) *
SUM_{t in q} ( tf(t in d) * idf(t)^2 * queryBoost(t) * fieldBoost(t.field in d)
* lengthFieldNorm(t.field in d) )
where
coord(q, d) = "fraction of TermQueries occuring in d"
queryNorm(q) = 1 / SQRT( SUM_{t in q} ( (idf(t) * queryBoost(t) )^2 ) )
I hope this starts a discussion.
Christoph

To unsubscribe, email: lucenedevunsubscribe@jakarta.apache.org
For additional commands, email: lucenedevhelp@jakarta.apache.org
