lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: Could positions/payloads in SegmentMerger be copied directly?
Date Tue, 23 Sep 2008 15:21:45 GMT
Op Tuesday 23 September 2008 10:56:04 schreef Michael McCandless:
> Paul Elschot wrote:
> > I had another look at SegmentTermDocs.skipTo() and at
> > SegmentTermPositions, and I think I'm beginning to get
> > your point.
> >
> > Could it be doable per skipInterval docs?
> Almost ... but not quite, except maybe for the first segment being
> merged.
> The problem is, the new skip data will not in general be "aligned" to
> the old skip data, except for the first segment.
> EG the skipInterval is 16; say for term "foo" the first segment has
> 18 docs and the 2nd segment has 22 docs.  We could bulk-copy that
> first chunk of 16 docs from the first segment, but then we write
> another 2 docs and then 14 docs into the 2nd segment we need to write
> new skip data, so we cannot bulk copy the 2nd segment since then we
> won't know the byte offset at that 14 doc point.
> I guess we could entertain allowing skip intervals to not be
> "regular", such that at the boundaries of previously merged segments
> it's allowed to be different, but that's getting more invasive.
> We have recently made great strides having merging be a bulk
> byte-copy operation when possible (eg stored fields & term vectors do
> this now), so I agree it'd be fabulous to get the postings to do bulk
> byte copy.  They are the slowest part of merging now.
> The frq postings could "almost" be made appendable, if we stored the
> last docID in a posting list in the term dictionary.  This way we
> could append, but simply rewrite only the first document of each
> segment after the first segment to be the delta of its docID and the
> last docID in the segment before it.  But again we'd be in trouble
> with the skip data.

So, adding a document offset from the  documents/frequencies
into the positions/payloads for each document would allow:
-  bulk copying of the position/payloads during merging, and 
-  a more efficient implementation of TermPositions.skipTo()
   in that decoding the positions from the last available skip
   document to the target of skipTo() could be avoided.
Is that correct?

That would indeed be invasive.

Paul Elschot

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message