lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Han Jiang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
Date Fri, 16 Aug 2013 17:08:48 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Han Jiang updated LUCENE-5179:
------------------------------

    Attachment: LUCENE-5179.patch

Patch for branch3069, tests pass for all 'temp' postings format.
                
> Refactoring on PostingsWriterBase for delta-encoding
> ----------------------------------------------------
>
>                 Key: LUCENE-5179
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5179
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>             Fix For: 5.0, 4.5
>
>         Attachments: LUCENE-5179.patch
>
>
> A further step from LUCENE-5029.
> The short story is, previous API change brings two problems:
> * it somewhat breaks backward compatibility: although we can still read old format,
>   we can no longer reproduce it;
> * pulsing codec have problem with it.
> And long story...
> With the change, current PostingsBase API will be like this:
> * term dict tells PBF we start a new term (via startTerm());
> * PBF adds docs, positions and other postings data;
> * term dict tells PBF all the data for current term is completed (via finishTerm()),
>   then PBF returns the metadata for current term (as long[] and byte[]);
> * term dict might buffer all the metadata in an ArrayList. when all the term is collected,
>   it then decides how those metadata will be located on disk.
> So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead
> term dict maintains the <term, metadata> list.
> However, for each term we'll now write long[] blob before byte[], so the index format
is not consistent with pre-4.5.
> like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to
write as longA,longB,bytesA.
> Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded,
after all
> PulsingPostingsWriter is only a PBF.
> For example, we have terms=["a", "a1", "a2", "b", "b1" "b2"] and itemsInBlock=2, so theoretically
> we'll finally have three blocks in BTTR: ["a" "b"]  ["a1" "a2"]  ["b1" "b2"], with this
> approach, the metadata of term "b" is delta encoded base on metadata of "a". but when
term dict tells
> PBF to finishTerm("b"), it might silly do the delta encode base on term "a2".
> So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo,
TermState, boolean absolute)',
> so that during metadata flush, we can control how current term is written? And the term
dict will buffer TermState, which
> implicitly holds metadata like we do in PBReader side.
> For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0,
then PBF
> writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is
solved.
> For pulsing codec, it will also be able to tell lower level how to encode metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message