lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: TermEnum.docFreq() includes deleted docs
Date Wed, 18 Jul 2012 21:26:59 GMT
On Tue, Jul 17, 2012 at 12:44 PM, Roman Chyla <> wrote:
> Hi,
> Tests show that TermEnum.docFreq() returns sum of all docs, including
> the deleted ones. Which seems to (indirectly) contradict the javadoc

That's right; fixing it to reflect deleted documents would be
prohibitively costly.

Hmm which version/javadocs are you looking at?  IndexReader.docFreq at
least calls out this limitation.

> This frequency count is used to compute uninverted index
> (DocTermOrds.uninvert()). The code goes like:
>       final int df = te.docFreq();
>       if (df <= maxTermDocFreq) {
> So, if I happen to have many deleted documents, and maxTermDocFreq is
> low, then the term will be excluded (even if the freq of the livedocs
> is OK). Most likely, the cache will be incomplete.
> Can it be considered a feature? Or is it a bug?

Maybe we could pro-rate the return docFreq by the pctg of deleted
documents?  It wouldn't be perfectly correct but on average should
have the right effect (keeping RAM consumption down)?

Can you open a Jira issue?  Thanks.

Mike McCandless

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message