lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Could positions/payloads in SegmentMerger be copied directly?
Date Mon, 22 Sep 2008 18:38:02 GMT
Mike,

I had another look at SegmentTermDocs.skipTo() and at
SegmentTermPositions, and I think I'm beginning to get
your point.

Could it be doable per skipInterval docs?

Regards,
Paul Elschot


Op Monday 22 September 2008 19:24:38 schreef Michael McCandless:
> OK, on closer inspection, I don't think this optimization will work,
> unless I'm missing something... But it was a good idea, so keep em
> coming!
>
> The TermInfo only stores proxPointer for each term, not per document
> in the postings.  This means the optimization could only apply if
> there are no deleted docs in the posting, and the in & out formats
> are congruent.  Then we would move writing to proxOutput out of the
> while loop in appendPostings to do a bulk copy of all bytes in the
> proxStream for that one term & segment.
>
> But, there's a problem with that: we can't compute the skip pointer
> as we write.  The DefaultSkipListWriter looks at the proxOutput
> pointer every skipInterval docs written and records the offset.  If
> we bulk-copy the prox bytes at the end we have no idea what the
> offset is every skipInterval docs.
>
> Mike
>
> Paul Elschot wrote:
> > Op Friday 19 September 2008 17:05:29 schreef Michael McCandless:
> >> Not quite, because how positions are encoded depends on whether
> >> any payload appeared in that segment.
> >>
> >> However, if 1) the input is a SegmentReader (since in general we
> >> can merge any IndexReader), and 2) its format is "congruent" with
> >> the format we are writing (ie both don't or do use the payloads
> >> format), which ought to be true the vast majority of the time,
> >> then I think we could simply copy bytes.  Since the next TermInfo
> >> tells us the proxPointer where it begins, we know exactly how many
> >> bytes to copy. I think this'd be a nice optimization!
> >
> > I tried to find a way to do this, but I'm stuck at the point where
> > the proxPointer is needed from a TermInfo.
> > I got this far (uncompiled code, smi is the SegmentMergeInfo
> > that is currently merged):
> >
> >    if (smi.reader instanceof SegmentReader) {
> >      SegmentReader inputReader = smi.reader;
> >      boolean readerStorePayloads =
> > inputReader.fieldInfos.fieldInfo(smi.term.field).storePayloads;
> >      if (storePayloads == readerStorePayloads) {
> >        // take the difference of the two prox pointers:
> >        int positionsLength = inputReader.tis. ... -  ...;
> >        // do a direct byte copy from inputReader to proxOutput:
> >        ... ;
> >      }
> >    }
> >
> > but I could not find out how to get from the TermInfosReader
> > at inputReader.tis to the next prox pointer.
> >
> > SegmentMerger never needs to index the positions by using a
> > proxPointer itself, as it accesses all positions serially. This
> > leaves me without an example on how to use proxPointer from a
> > TermInfo.
> >
> > Any tips on how to continue?
> >
> > Regards,
> > Paul Elschot
> >
> >> Mike
> >>
> >> Paul Elschot wrote:
> >>> I'm looking at the for loop in SegmentMerger.java at line 666,
> >>> which completely interprets the input positions/payloads for
> >>> an input term at a document.
> >>>
> >>> The positions/payloads don't change when they merged, is that
> >>> correct? I'm wondering whether this loop could be replaced by a
> >>> direct copy from
> >>> the input postings to proxOutput.
> >>>
> >>> Regards,
> >>> Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message