lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
Date Mon, 12 Nov 2018 17:14:00 GMT


Adrien Grand commented on LUCENE-8563:

Agreed [~softwaredoug] I was assuming a single similarity. This would also change ordering
if other fields use different similarities.

> Remove k1+1 from the numerator of  BM25Similarity
> -------------------------------------------------
>                 Key: LUCENE-8563
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify ordering. It
is often omitted and I found out that the "The Probabilistic Relevance Framework: BM25 and
Beyond" paper by Robertson (BM25's author) and Zaragova even describes adding (k1+1) to the
numerator as a variant whose benefit is to be more comparable with Robertson/Sparck-Jones
weighting, which we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score contributions (eg.
via oal.document.FeatureField) would be a bit easier to reason about. For instance a weight
of 3 in FeatureField#newSaturationQuery would have a similar impact as a term whose IDF is
3 (and thus docFreq ~= 5%) rather than a term whose IDF is 3/(k1 + 1).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message