> > I would like to implement the Okapi BM25 weighting function
> > using my own Similarity implementation. Unfortunately BM25
> > requires the document length in the score calculation, which
> > is not provided by the Scorer.
>
> How do you want to measure document length? If the number of
> tokens is an acceptable measure, then the norm contains
> sqrt(numTokens) by default. You can modify your
> Similarity.lengthNorm() implementation to not perform the
> sqrt, or square the norm.
I assume the number of tokens will be a good estimate.
I've included an image with the algorithm (my ASCII art isn't that good).
Legend of the figure:
 k1, k3 and b are constants
 tf is the within document term frequency
 df is the document frequency
 N is the collection size
 r is the number of relevant documents containing a particular term (without relevance information
assumed to be 0)
 R is the number of items known to be relevant to a specific topic (without relevance information
assumed to be 0)
As far is I understand Lucene multiplies the squared weight with the result of Similarity.lengthNorm(),
but BM25 requires the document length for the calculation of the document term weighting (as
far as I know it's not possible to extract the influence of the normalization as a constant
multiplier).
Am I missing something here?
Dolf
