lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: delete entries from posting list Lucene 4.0
Date Thu, 29 Mar 2012 09:14:48 GMT
On 27/03/2012 20:25, Zeynep P. wrote:
> While using the pruning package, I realised that ridf is calculated in
> RIDFTermPruningPolicy as follows:
> Math.log(1 - Math.pow(Math.E, termPositions.freq() / maxDoc)) - df
> However, according to the original paper (Blanco et al.) for residual idf,
> it should be -log(df/D) + log (1 - e^(*-*tf/D)). Thus, in the equation,
> Math.pow should be Math.pow(Math.E, - (termPositions.freq() / maxDoc))
> Do I miss something in the calculation or is this a bug?

Hmm, good question! After checking the original paper again, and then 
checking our implementation, I think that this is indeed a bug, and we 
should add the minus there, but ... this formula may be completely 
broken either way. The paper that you mention 
says thus:

"Residual idf is defined in [3] as the difference between the observed 
idf (IDF ) and the idf expected under the assumption that the terms 
follow an independence model, such as Poisson (IDF^). [...] If tf is the 
total number of tokens for a term t, then the ridf devised by a Poisson 
distribution is

RIDF = IDF āˆ’ IDF^ = āˆ’log(df/D) + log(1 āˆ’ e^(-tf/D))	[2]

Since the purpose of the RIDF metric is to select informative words 
collection-wide, and not per-document, then it makes sense that they use 
a collection-wide metric like IDF as a baseline vs. another 
collection-wide metric based on total term frequency, or rather the 
total number of term occurrences in a collection.

The problem in our implementation is that we use a within-document term 
frequency (the number of occurrences of t in the current document) and 
not a collection-wide term frequency... so, it looks to me that the fix 
would be to first fully traverse the doc enumeration and calculate the 
total number of term occurrences in all documents (e.g. in 
RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the 
formula in place of termPositions.freq().

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message