lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3934) Residual IDF calculation in the pruning package is wrong
Date Thu, 29 Mar 2012 18:22:22 GMT


Andrzej Bialecki  commented on LUCENE-3934:

Eh, it's even worse - the [|paper]
that we used as a reference is buggy itself :) or at least misleading.

Formula 1 that supposedly gives the Robertson-Sparck-Jones normalization of idf should really
read (according to [|its authors]:
IDF = log ( ((D - df) + 0.5) / (df + 0.5) )

  or: IDF = - log ( (df + 0.5) / ((D - df) + 0.5) )
As it's presented in the Blanco-Barreiro paper it would be invalid (for some values the argument
to log() would be negative).

At this point I wasn't sure about the Formula 2 in Blanco-Barreiro, because going by the definition
it should be a difference between the observed IDF - that is, the one that is calculated in
Formula 1 - and an expected estimate based on a Poisson model, denoted as expIDF. Whereas
the Formula 2 seemed different... After searching the literature for a while I found [|another
paper] by Murray-Renals where a formula for RIDF is presented clearly enough for math-challenged
people like me:
expIDF = - log ( 1 - e^(-totalFreq/D) )
So, to summarize, the Formula 2 in the Blanco-Barreiro paper should look something like this:
RIDF = log(((D - df) + 0.5) / (df + 0.5)) + log( 1 - e^(-totalFreq/D) )

   or: RIDF = -log((df + 0.5) / ((D - df) + 0.5)) + log( 1 - e^(-totalFreq/D) )

Now, comparing to the original formula from the Blanco-Barreiro paper we can clearly see that
it is similar, but it differs in the way it calculates IDF:
RIDF = - log(df/D) + log(1 - e^(-totalFreq/D))       (Formula 2)
Which means that even though they mention the Robertson-Sparck-Jones normalization they don't
use it (and neither do Murray and Renals in their paper).

To summarize, I think the Formula 2 is correct, and our code has to be fixed. Patch is coming
shortly, I need to write a unit test.
> Residual IDF calculation in the pruning package is wrong
> --------------------------------------------------------
>                 Key: LUCENE-3934
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 3.6
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
> As discussed on the mailing list ( there
seems to be a bug in both the formula and in the way RIDF is calculated. The formula is missing
a minus, but also the calculation uses local (in-document) term frequency instead of the total
term frequency (sum of all term occurrences in a corpus).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message