[ https://issues.apache.org/jira/browse/LUCENE3934?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13241475#comment13241475
]
Andrzej Bialecki commented on LUCENE3934:

Eh, it's even worse  the [http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdfpaper]
that we used as a reference is buggy itself :) or at least misleading.
Formula 1 that supposedly gives the RobertsonSparckJones normalization of idf should really
read (according to [http://terrierteam.dcs.gla.ac.uk/publications/rtlo_DIRpaper.pdfits authors]:
{code}
IDF = log ( ((D  df) + 0.5) / (df + 0.5) )
or: IDF =  log ( (df + 0.5) / ((D  df) + 0.5) )
{code}
As it's presented in the BlancoBarreiro paper it would be invalid (for some values the argument
to log() would be negative).
At this point I wasn't sure about the Formula 2 in BlancoBarreiro, because going by the definition
it should be a difference between the observed IDF  that is, the one that is calculated in
Formula 1  and an expected estimate based on a Poisson model, denoted as expIDF. Whereas
the Formula 2 seemed different... After searching the literature for a while I found [http://www.cstr.ed.ac.uk/downloads/publications/2007/48920155.pdfanother
paper] by MurrayRenals where a formula for RIDF is presented clearly enough for mathchallenged
people like me:
{code}
expIDF =  log ( 1  e^(totalFreq/D) )
RIDF = IDF  expIDF
{code}
So, to summarize, the Formula 2 in the BlancoBarreiro paper should look something like this:
{code}
RIDF = log(((D  df) + 0.5) / (df + 0.5)) + log( 1  e^(totalFreq/D) )
or: RIDF = log((df + 0.5) / ((D  df) + 0.5)) + log( 1  e^(totalFreq/D) )
{code}
Now, comparing to the original formula from the BlancoBarreiro paper we can clearly see that
it is similar, but it differs in the way it calculates IDF:
{code}
RIDF =  log(df/D) + log(1  e^(totalFreq/D)) (Formula 2)
{code}
Which means that even though they mention the RobertsonSparckJones normalization they don't
use it (and neither do Murray and Renals in their paper).
To summarize, I think the Formula 2 is correct, and our code has to be fixed. Patch is coming
shortly, I need to write a unit test.
> Residual IDF calculation in the pruning package is wrong
> 
>
> Key: LUCENE3934
> URL: https://issues.apache.org/jira/browse/LUCENE3934
> Project: Lucene  Java
> Issue Type: Bug
> Affects Versions: 3.5, 3.6
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
>
> As discussed on the mailing list (http://markmail.org/message/cwnyfqmet3wophec) there
seems to be a bug in both the formula and in the way RIDF is calculated. The formula is missing
a minus, but also the calculation uses local (indocument) term frequency instead of the total
term frequency (sum of all term occurrences in a corpus).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, email: devunsubscribe@lucene.apache.org
For additional commands, email: devhelp@lucene.apache.org
