lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Search and Scoring
Date Wed, 13 Oct 2004 09:03:35 GMT
> As an aside, is there a reason that idf is squared in each Term and
> Phrase match (it is multiplied both into the query component and the
> field component)?  To compensate for this, I'm taking the square root of
> the idf I really want in my Similarity, which seems strange.

Hi Chuck,

that's a very good question. And you are right, it may be a bug, I am
not sure about it. I stumbled over this several times when studying
code in the search package. It's a little bit difficult to explain since
the code for score computation is distributed over Weight and Scorer
classes. It seems that a TermQuery and a PhraseQuery weight is
multiplied with idf twice, first in sumOfSquaredWeights() and then in
normalize. That's what you discovered.

The formula in Similarity Javadoc does not describe the scoring completely.
I try to write down the formula that exactly describes the current
implementation. Then we can start a discussion and people could decide
whether this is the intended scoring. (I assume DefaultSimilarity here)

Lt's start with the simple case. A pure TermQuery (one word query) gets
the following score after cancelling down queryNorm(t) and queryBoost(t)
(coord is 1 here)

t: TermQuery
d: document

score(t, d) =
  tf(t in d) * idf(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field in d)

Note that fieldBoost and lengthNorm are both combined in norms.

For a BooleanQuery consisting of several TermQueries we get the following:
(Again we can cancel down queryBoost(q))

q: BooleanQuery
t: Term and corresponding TermQuery
d: document

score(q, d) = coord(q, d) * queryNorm(q) *
  SUM_{t in q} ( tf(t in d) * idf(t)^2 * queryBoost(t) * fieldBoost(t.field in d)
    * lengthFieldNorm(t.field in d) )

where
coord(q, d) = "fraction of TermQueries occuring in d"
queryNorm(q) = 1 / SQRT( SUM_{t in q} ( (idf(t) * queryBoost(t) )^2 ) )

I hope this starts a discussion.

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message