lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Created] (LUCENE-4498) pulse docfreq=1 DOCS_ONLY for 4.1 codec
Date Mon, 22 Oct 2012 14:12:15 GMT
Robert Muir created LUCENE-4498:

             Summary: pulse docfreq=1 DOCS_ONLY for 4.1 codec
                 Key: LUCENE-4498
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/codecs
            Reporter: Robert Muir

We have pulsing codec, but currently this has some downsides:
* its very general, wrapping an arbitrary postingsformat and pulsing everything in the postings
for an arbitrary docfreq/totalTermFreq cutoff
* reuse is hairy: because it specializes its enums based on these cutoffs, when walking thru
terms e.g. merging there is a lot of sophisticated stuff to avoid the worst cases where we
clone indexinputs for tons of terms.

On the other hand the way the 4.1 codec encodes "primary key" fields is pretty silly, we write
the docStartFP vlong in the term dictionary metadata, which tells us where to seek in the
.doc to read our one lonely vint.

I think its worth investigating that in the DOCS_ONLY docfreq=1 case, we just write the lone
doc delta where we would write docStartFP. 

We can avoid the hairy reuse problem too, by just supporting this in refillDocs() in BlockDocsEnum
instead of specializing.

This would remove the additional seek for "primary key" fields without really any of the downsides
of pulsing today.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message