lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Contribution: better multi-field searching
Date Wed, 13 Oct 2004 16:24:37 GMT
Paul Elschot wrote:
>>Did you see my IDF question at the bottom of the original note?  I'm
>>really curious why the square of IDF is used for Term and Phrase
>>queries, rather than just IDF.  It seems like it might be a bug?
> 
> I missed that.
> It has been discussed recently, but I don't remember the outcome,
> perhaps some else?

This has indeed been discussed before.

Lucene computes a dot-product of a query vector and each document 
vector.  Weights in both vectors are normalized tf*idf, i.e., 
(tf*idf)/length.  The dot product of vectors d and q is:

   score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )

Given this formulation, and the use of tf*idf weights, each component of 
the sum has an idf^2 factor.  That's just the way it works with dot 
products of tf*idf/length vectors.  It's not a bug.  If folks don't like 
it they can simply override Similarity.idf() to return sqrt(super()).

If someone can demonstrate that an alternate formulation produces 
superior results for most applications, then we should of course change 
the default implementation.  But just noting that there's a factor which 
is equal to idf^2 in each element of the sum does not do this.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message