lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Created: (LUCENE-2654) bulk-code each chunk b/w indexed terms in the terms dict
Date Sun, 19 Sep 2010 15:09:32 GMT
bulk-code each chunk b/w indexed terms in the terms dict

                 Key: LUCENE-2654
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 4.0
            Reporter: Michael McCandless
            Priority: Minor

This is an idea for exploration that came up w/ Robert...

In PrefixCodedTermsDict (used by the default Standard codec), we encode each term entry "standalone",
using vInts.  We store the changed suffix (start, end, bytes), then metadata for the term
like docFreq, frq start, prx start, skip start.  Each of these ints is a vInt, which is relatively

If instead we store the N terms between indexed terms "column-stride", using bulk codec like
FOR/PFOR, so that the 32 docFreqs are stored as one block, 32 frq deltas as another, etc.,
then seek and next should be faster.  Ie, we could make decode of the metadata lazy, so that
a seek to a term that does not exist may be able avoid any metadata decode entirely.  Sequential
scanning (lots of .next in a row) would also be faster, even if it needs the metadata since
bulk-decode should be faster than multiple vInt decodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message