lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3934) Residual IDF calculation in the pruning package is wrong
Date Thu, 29 Mar 2012 18:22:22 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241475#comment-13241475
] 

Andrzej Bialecki  commented on LUCENE-3934:
-------------------------------------------

Eh, it's even worse - the [http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf|paper]
that we used as a reference is buggy itself :) or at least misleading.

Formula 1 that supposedly gives the Robertson-Sparck-Jones normalization of idf should really
read (according to [http://terrierteam.dcs.gla.ac.uk/publications/rtlo_DIRpaper.pdf|its authors]:
{code}
IDF = log ( ((D - df) + 0.5) / (df + 0.5) )

  or: IDF = - log ( (df + 0.5) / ((D - df) + 0.5) )
{code}
As it's presented in the Blanco-Barreiro paper it would be invalid (for some values the argument
to log() would be negative).

At this point I wasn't sure about the Formula 2 in Blanco-Barreiro, because going by the definition
it should be a difference between the observed IDF - that is, the one that is calculated in
Formula 1 - and an expected estimate based on a Poisson model, denoted as expIDF. Whereas
the Formula 2 seemed different... After searching the literature for a while I found [http://www.cstr.ed.ac.uk/downloads/publications/2007/48920155.pdf|another
paper] by Murray-Renals where a formula for RIDF is presented clearly enough for math-challenged
people like me:
{code}
expIDF = - log ( 1 - e^(-totalFreq/D) )
RIDF = IDF - expIDF
{code}
So, to summarize, the Formula 2 in the Blanco-Barreiro paper should look something like this:
{code}
RIDF = log(((D - df) + 0.5) / (df + 0.5)) + log( 1 - e^(-totalFreq/D) )

   or: RIDF = -log((df + 0.5) / ((D - df) + 0.5)) + log( 1 - e^(-totalFreq/D) )

{code}
Now, comparing to the original formula from the Blanco-Barreiro paper we can clearly see that
it is similar, but it differs in the way it calculates IDF:
{code}
RIDF = - log(df/D) + log(1 - e^(-totalFreq/D))       (Formula 2)
{code}
Which means that even though they mention the Robertson-Sparck-Jones normalization they don't
use it (and neither do Murray and Renals in their paper).

To summarize, I think the Formula 2 is correct, and our code has to be fixed. Patch is coming
shortly, I need to write a unit test.
                
> Residual IDF calculation in the pruning package is wrong
> --------------------------------------------------------
>
>                 Key: LUCENE-3934
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3934
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 3.6
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> As discussed on the mailing list (http://markmail.org/message/cwnyfqmet3wophec) there
seems to be a bug in both the formula and in the way RIDF is calculated. The formula is missing
a minus, but also the calculation uses local (in-document) term frequency instead of the total
term frequency (sum of all term occurrences in a corpus).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message