Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 678099E26 for ; Thu, 29 Mar 2012 18:22:45 +0000 (UTC) Received: (qmail 26356 invoked by uid 500); 29 Mar 2012 18:22:44 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 26289 invoked by uid 500); 29 Mar 2012 18:22:44 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 26282 invoked by uid 99); 29 Mar 2012 18:22:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 18:22:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 18:22:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 587FC34C7BF for ; Thu, 29 Mar 2012 18:22:22 +0000 (UTC) Date: Thu, 29 Mar 2012 18:22:22 +0000 (UTC) From: "Andrzej Bialecki (Issue Comment Edited) (JIRA)" To: dev@lucene.apache.org Message-ID: <1128805910.34041.1333045342363.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <146650398.31957.1333014388721.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Issue Comment Edited] (LUCENE-3934) Residual IDF calculation in the pruning package is wrong MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241475#comment-13241475 ] Andrzej Bialecki edited comment on LUCENE-3934 at 3/29/12 6:21 PM: -------------------------------------------------------------------- Eh, it's even worse - the [paper|http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf] that we used as a reference is buggy itself :) or at least misleading. Formula 1 that supposedly gives the Robertson-Sparck-Jones normalization of idf should really read (according to [its authors|http://terrierteam.dcs.gla.ac.uk/publications/rtlo_DIRpaper.pdf]: {code} IDF = log ( ((D - df) + 0.5) / (df + 0.5) ) or: IDF = - log ( (df + 0.5) / ((D - df) + 0.5) ) {code} As it's presented in the Blanco-Barreiro paper it would be invalid (for some values the argument to log() would be negative). At this point I wasn't sure about the Formula 2 in Blanco-Barreiro, because going by the definition it should be a difference between the observed IDF - that is, the one that is calculated in Formula 1 - and an expected estimate based on a Poisson model, denoted as expIDF. Whereas the Formula 2 seemed different... After searching the literature for a while I found [another paper|http://www.cstr.ed.ac.uk/downloads/publications/2007/48920155.pdf] by Murray-Renals where a formula for RIDF is presented clearly enough for math-challenged people like me: {code} expIDF = - log ( 1 - e^(-totalFreq/D) ) RIDF = IDF - expIDF {code} So, to summarize, the Formula 2 in the Blanco-Barreiro paper should look something like this: {code} RIDF = log(((D - df) + 0.5) / (df + 0.5)) + log( 1 - e^(-totalFreq/D) ) or: RIDF = -log((df + 0.5) / ((D - df) + 0.5)) + log( 1 - e^(-totalFreq/D) ) {code} Now, comparing to the original formula from the Blanco-Barreiro paper we can clearly see that it is similar, but it differs in the way it calculates IDF: {code} RIDF = - log(df/D) + log(1 - e^(-totalFreq/D)) (Formula 2) {code} Which means that even though they mention the Robertson-Sparck-Jones normalization they don't use it (and neither do Murray and Renals in their paper). To summarize, I think the Formula 2 is correct, and our code has to be fixed. Patch is coming shortly, I need to write a unit test. (Edit: links were broken) was (Author: ab): Eh, it's even worse - the [http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf|paper] that we used as a reference is buggy itself :) or at least misleading. Formula 1 that supposedly gives the Robertson-Sparck-Jones normalization of idf should really read (according to [http://terrierteam.dcs.gla.ac.uk/publications/rtlo_DIRpaper.pdf|its authors]: {code} IDF = log ( ((D - df) + 0.5) / (df + 0.5) ) or: IDF = - log ( (df + 0.5) / ((D - df) + 0.5) ) {code} As it's presented in the Blanco-Barreiro paper it would be invalid (for some values the argument to log() would be negative). At this point I wasn't sure about the Formula 2 in Blanco-Barreiro, because going by the definition it should be a difference between the observed IDF - that is, the one that is calculated in Formula 1 - and an expected estimate based on a Poisson model, denoted as expIDF. Whereas the Formula 2 seemed different... After searching the literature for a while I found [http://www.cstr.ed.ac.uk/downloads/publications/2007/48920155.pdf|another paper] by Murray-Renals where a formula for RIDF is presented clearly enough for math-challenged people like me: {code} expIDF = - log ( 1 - e^(-totalFreq/D) ) RIDF = IDF - expIDF {code} So, to summarize, the Formula 2 in the Blanco-Barreiro paper should look something like this: {code} RIDF = log(((D - df) + 0.5) / (df + 0.5)) + log( 1 - e^(-totalFreq/D) ) or: RIDF = -log((df + 0.5) / ((D - df) + 0.5)) + log( 1 - e^(-totalFreq/D) ) {code} Now, comparing to the original formula from the Blanco-Barreiro paper we can clearly see that it is similar, but it differs in the way it calculates IDF: {code} RIDF = - log(df/D) + log(1 - e^(-totalFreq/D)) (Formula 2) {code} Which means that even though they mention the Robertson-Sparck-Jones normalization they don't use it (and neither do Murray and Renals in their paper). To summarize, I think the Formula 2 is correct, and our code has to be fixed. Patch is coming shortly, I need to write a unit test. > Residual IDF calculation in the pruning package is wrong > -------------------------------------------------------- > > Key: LUCENE-3934 > URL: https://issues.apache.org/jira/browse/LUCENE-3934 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 3.5, 3.6 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > > As discussed on the mailing list (http://markmail.org/message/cwnyfqmet3wophec) there seems to be a bug in both the formula and in the way RIDF is calculated. The formula is missing a minus, but also the calculation uses local (in-document) term frequency instead of the total term frequency (sum of all term occurrences in a corpus). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org