lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Could positions/payloads in SegmentMerger be copied directly?
Date Mon, 22 Sep 2008 17:24:38 GMT

OK, on closer inspection, I don't think this optimization will work,
unless I'm missing something... But it was a good idea, so keep em
coming!

The TermInfo only stores proxPointer for each term, not per document
in the postings.  This means the optimization could only apply if
there are no deleted docs in the posting, and the in & out formats are
congruent.  Then we would move writing to proxOutput out of the while
loop in appendPostings to do a bulk copy of all bytes in the
proxStream for that one term & segment.

But, there's a problem with that: we can't compute the skip pointer as
we write.  The DefaultSkipListWriter looks at the proxOutput pointer
every skipInterval docs written and records the offset.  If we
bulk-copy the prox bytes at the end we have no idea what the offset is
every skipInterval docs.

Mike

Paul Elschot wrote:

> Op Friday 19 September 2008 17:05:29 schreef Michael McCandless:
>> Not quite, because how positions are encoded depends on whether any
>> payload appeared in that segment.
>>
>> However, if 1) the input is a SegmentReader (since in general we can
>> merge any IndexReader), and 2) its format is "congruent" with the
>> format we are writing (ie both don't or do use the payloads format),
>> which ought to be true the vast majority of the time, then I think we
>> could simply copy bytes.  Since the next TermInfo tells us the
>> proxPointer where it begins, we know exactly how many bytes to copy.
>> I think this'd be a nice optimization!
>
> I tried to find a way to do this, but I'm stuck at the point where
> the proxPointer is needed from a TermInfo.
> I got this far (uncompiled code, smi is the SegmentMergeInfo
> that is currently merged):
>
>    if (smi.reader instanceof SegmentReader) {
>      SegmentReader inputReader = smi.reader;
>      boolean readerStorePayloads =
> inputReader.fieldInfos.fieldInfo(smi.term.field).storePayloads;
>      if (storePayloads == readerStorePayloads) {
>        // take the difference of the two prox pointers:
>        int positionsLength = inputReader.tis. ... -  ...;
>        // do a direct byte copy from inputReader to proxOutput:
>        ... ;
>      }
>    }
>
> but I could not find out how to get from the TermInfosReader
> at inputReader.tis to the next prox pointer.
>
> SegmentMerger never needs to index the positions by using a
> proxPointer itself, as it accesses all positions serially. This leaves
> me without an example on how to use proxPointer from a TermInfo.
>
> Any tips on how to continue?
>
> Regards,
> Paul Elschot
>
>
>> Mike
>>
>> Paul Elschot wrote:
>>> I'm looking at the for loop in SegmentMerger.java at line 666,
>>> which completely interprets the input positions/payloads for
>>> an input term at a document.
>>>
>>> The positions/payloads don't change when they merged, is that
>>> correct? I'm wondering whether this loop could be replaced by a
>>> direct copy from
>>> the input postings to proxOutput.
>>>
>>> Regards,
>>> Paul Elschot
>>>
>>> -------------------------------------------------------------------
>>> -- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message