lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
Date Fri, 16 Aug 2013 20:07:47 GMT


Michael McCandless commented on LUCENE-5179:

So, the idea with this patch is to go back to letting the PBF encode
the metadata for the term?  Just, one term at a time, not the whole
block that we have on trunk today.

And the reason for this is back-compat?  Ie, so that in test-framework
we can have writers for the old formats?

One thing that this change precludes is having the terms dict use
different encodings than simple delta vInt to encode the long[]
metadata, e.g. Simple9/16 or something?  But that's OK ... we can
explore those later.

It's sort of frustrating to have to compromise the design just for
back-compat ... e.g. we could instead cheat a bit, and have the
writers write the newer format.  It's easy to make the readers read
either format right?

But ... I don't understand how this change helps Pulsing, or rather
why Pulsing would have trouble w/ the API we have today?

> Refactoring on PostingsWriterBase for delta-encoding
> ----------------------------------------------------
>                 Key: LUCENE-5179
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>             Fix For: 5.0, 4.5
>         Attachments: LUCENE-5179.patch
> A further step from LUCENE-5029.
> The short story is, previous API change brings two problems:
> * it somewhat breaks backward compatibility: although we can still read old format,
>   we can no longer reproduce it;
> * pulsing codec have problem with it.
> And long story...
> With the change, current PostingsBase API will be like this:
> * term dict tells PBF we start a new term (via startTerm());
> * PBF adds docs, positions and other postings data;
> * term dict tells PBF all the data for current term is completed (via finishTerm()),
>   then PBF returns the metadata for current term (as long[] and byte[]);
> * term dict might buffer all the metadata in an ArrayList. when all the term is collected,
>   it then decides how those metadata will be located on disk.
> So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead
> term dict maintains the <term, metadata> list.
> However, for each term we'll now write long[] blob before byte[], so the index format
is not consistent with pre-4.5.
> like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to
write as longA,longB,bytesA.
> Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded,
after all
> PulsingPostingsWriter is only a PBF.
> For example, we have terms=["a", "a1", "a2", "b", "b1" "b2"] and itemsInBlock=2, so theoretically
> we'll finally have three blocks in BTTR: ["a" "b"]  ["a1" "a2"]  ["b1" "b2"], with this
> approach, the metadata of term "b" is delta encoded base on metadata of "a". but when
term dict tells
> PBF to finishTerm("b"), it might silly do the delta encode base on term "a2".
> So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo,
TermState, boolean absolute)',
> so that during metadata flush, we can control how current term is written? And the term
dict will buffer TermState, which
> implicitly holds metadata like we do in PBReader side.
> For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0,
then PBF
> writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is
> For pulsing codec, it will also be able to tell lower level how to encode metadata.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message