Sorry, the image wasn't sent:
http://wwwhome.cs.utwente.nl/~trieschn/bm25.PNG
> Original Message
> From: Trieschnigg, R.B. (Dolf)
> [mailto:r.b.trieschnigg@ewi.utwente.nl]
> Sent: vrijdag 17 februari 2006 10:54
> To: javauser@lucene.apache.org
> Subject: RE: BM25 Similarity implementation
>
> > > I would like to implement the Okapi BM25 weighting
> function using my
> > > own Similarity implementation. Unfortunately BM25 requires the
> > > document length in the score calculation, which is not
> provided by
> > > the Scorer.
> >
> > How do you want to measure document length? If the number
> of tokens
> > is an acceptable measure, then the norm contains
> > sqrt(numTokens) by default. You can modify your
> > Similarity.lengthNorm() implementation to not perform the sqrt, or
> > square the norm.
>
> I assume the number of tokens will be a good estimate.
>
> I've included an image with the algorithm (my ASCII art isn't
> that good).
> Legend of the figure:
>  k1, k3 and b are constants
>  tf is the within document term frequency
>  df is the document frequency
>  N is the collection size
>  r is the number of relevant documents containing a
> particular term (without relevance information assumed to be 0)
>  R is the number of items known to be relevant to a specific
> topic (without relevance information assumed to be 0)
>
> As far is I understand Lucene multiplies the squared weight
> with the result of Similarity.lengthNorm(), but BM25 requires
> the document length for the calculation of the document term
> weighting (as far as I know it's not possible to extract the
> influence of the normalization as a constant multiplier).
>
> Am I missing something here?
>
> Dolf

To unsubscribe, email: javauserunsubscribe@lucene.apache.org
For additional commands, email: javauserhelp@lucene.apache.org
