lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Trieschnigg, R.B. \(Dolf\)" <r.b.trieschn...@ewi.utwente.nl>
Subject RE: BM25 Similarity implementation
Date Fri, 17 Feb 2006 09:54:02 GMT
> > I would like to implement the Okapi BM25 weighting function 
> > using my own Similarity implementation. Unfortunately BM25 
> > requires the document length in the score calculation, which 
> > is not provided by the Scorer.
> 
> How do you want to measure document length?  If the number of 
> tokens is an acceptable measure, then the norm contains 
> sqrt(numTokens) by default.  You can modify your 
> Similarity.lengthNorm() implementation to not perform the 
> sqrt, or square the norm.

I assume the number of tokens will be a good estimate.

I've included an image with the algorithm (my ASCII art isn't that good).
Legend of the figure:
- k1, k3 and b are constants
- tf is the within document term frequency
- df is the document frequency
- N is the collection size
- r is the number of relevant documents containing a particular term (without relevance information
assumed to be 0)
- R is the number of items known to be relevant to a specific topic (without relevance information
assumed to be 0)

As far is I understand Lucene multiplies the squared weight with the result of Similarity.lengthNorm(),
but BM25 requires the document length for the calculation of the document term weighting (as
far as I know it's not possible to extract the influence of the normalization as a constant
multiplier).

Am I missing something here?

Dolf




Mime
View raw message