Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 61845DD28 for ; Mon, 22 Oct 2012 14:12:18 +0000 (UTC) Received: (qmail 79065 invoked by uid 500); 22 Oct 2012 14:12:15 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 79001 invoked by uid 500); 22 Oct 2012 14:12:15 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 78967 invoked by uid 99); 22 Oct 2012 14:12:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Oct 2012 14:12:15 +0000 Date: Mon, 22 Oct 2012 14:12:15 +0000 (UTC) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Message-ID: <2063226515.9844.1350915135133.JavaMail.jiratomcat@arcas> Subject: [jira] [Created] (LUCENE-4498) pulse docfreq=1 DOCS_ONLY for 4.1 codec MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Robert Muir created LUCENE-4498: ----------------------------------- Summary: pulse docfreq=1 DOCS_ONLY for 4.1 codec Key: LUCENE-4498 URL: https://issues.apache.org/jira/browse/LUCENE-4498 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Robert Muir We have pulsing codec, but currently this has some downsides: * its very general, wrapping an arbitrary postingsformat and pulsing everything in the postings for an arbitrary docfreq/totalTermFreq cutoff * reuse is hairy: because it specializes its enums based on these cutoffs, when walking thru terms e.g. merging there is a lot of sophisticated stuff to avoid the worst cases where we clone indexinputs for tons of terms. On the other hand the way the 4.1 codec encodes "primary key" fields is pretty silly, we write the docStartFP vlong in the term dictionary metadata, which tells us where to seek in the .doc to read our one lonely vint. I think its worth investigating that in the DOCS_ONLY docfreq=1 case, we just write the lone doc delta where we would write docStartFP. We can avoid the hairy reuse problem too, by just supporting this in refillDocs() in BlockDocsEnum instead of specializing. This would remove the additional seek for "primary key" fields without really any of the downsides of pulsing today. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org