Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0B559DD76 for ; Wed, 18 Jul 2012 21:27:48 +0000 (UTC) Received: (qmail 24419 invoked by uid 500); 18 Jul 2012 21:27:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 24351 invoked by uid 500); 18 Jul 2012 21:27:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 24341 invoked by uid 99); 18 Jul 2012 21:27:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jul 2012 21:27:45 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.217.176] (HELO mail-lb0-f176.google.com) (209.85.217.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jul 2012 21:27:40 +0000 Received: by lboj14 with SMTP id j14so3202559lbo.35 for ; Wed, 18 Jul 2012 14:27:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=ZCZ7Be1PsYfaS7lM2p4fPo6StSEwNK5ECA9rdwz0SOM=; b=UU+Mvn9KQ0viq3UrGPYVaeCnRNc0bmt9qxrGocdYcRvqfTgmZ5i580gVO2M3KSJBxC qq+e1rFMLhhxdLMwhwy2qJJlNr9SlOon+T0B7vTE10B8CbdVEvFrOCeNP86V3R+OCOTS amv79c2SgBx90mDidug0rOHezPu5kKWXO54Mf5cw8jR4ozLT2f0YreIMNpV070Wilz9M HuKkowROVHxVuKKAwnACqA5xC62VqHlyZHBrMHUF6lEMbEQfBYJw3WeSgf1gBqeA6KMG p5eR5Ty608Oiy5oziYy1WWWSZvoJLzQC6QZVDpHbEoJpTOWIsF7Wg3T/xNlC1tTsgNSq HuaQ== Received: by 10.152.136.18 with SMTP id pw18mr5270824lab.17.1342646839144; Wed, 18 Jul 2012 14:27:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.20.33 with HTTP; Wed, 18 Jul 2012 14:26:59 -0700 (PDT) In-Reply-To: References: From: Michael McCandless Date: Wed, 18 Jul 2012 17:26:59 -0400 Message-ID: Subject: Re: TermEnum.docFreq() includes deleted docs To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQkDoI5WaCRcnVxv/BkXaydvRenQslm7GMdvht6ggbvBM5F1vqXrI0j1biT44/vwy4rPEHs1 X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Jul 17, 2012 at 12:44 PM, Roman Chyla wrote: > Hi, > > Tests show that TermEnum.docFreq() returns sum of all docs, including > the deleted ones. Which seems to (indirectly) contradict the javadoc That's right; fixing it to reflect deleted documents would be prohibitively costly. Hmm which version/javadocs are you looking at? IndexReader.docFreq at least calls out this limitation. > This frequency count is used to compute uninverted index > (DocTermOrds.uninvert()). The code goes like: > > final int df = te.docFreq(); > if (df <= maxTermDocFreq) { > > > So, if I happen to have many deleted documents, and maxTermDocFreq is > low, then the term will be excluded (even if the freq of the livedocs > is OK). Most likely, the cache will be incomplete. > > Can it be considered a feature? Or is it a bug? Maybe we could pro-rate the return docFreq by the pctg of deleted documents? It wouldn't be perfectly correct but on average should have the right effect (keeping RAM consumption down)? Can you open a Jira issue? Thanks. Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org