lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: TermEnum.docFreq() includes deleted docs
Date Wed, 18 Jul 2012 21:26:59 GMT
On Tue, Jul 17, 2012 at 12:44 PM, Roman Chyla <roman.chyla@gmail.com> wrote:
> Hi,
>
> Tests show that TermEnum.docFreq() returns sum of all docs, including
> the deleted ones. Which seems to (indirectly) contradict the javadoc

That's right; fixing it to reflect deleted documents would be
prohibitively costly.

Hmm which version/javadocs are you looking at?  IndexReader.docFreq at
least calls out this limitation.

> This frequency count is used to compute uninverted index
> (DocTermOrds.uninvert()). The code goes like:
>
>       final int df = te.docFreq();
>       if (df <= maxTermDocFreq) {
>
>
> So, if I happen to have many deleted documents, and maxTermDocFreq is
> low, then the term will be excluded (even if the freq of the livedocs
> is OK). Most likely, the cache will be incomplete.
>
> Can it be considered a feature? Or is it a bug?

Maybe we could pro-rate the return docFreq by the pctg of deleted
documents?  It wouldn't be perfectly correct but on average should
have the right effect (keeping RAM consumption down)?

Can you open a Jira issue?  Thanks.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message