lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Could positions/payloads in SegmentMerger be copied directly?
Date Wed, 24 Sep 2008 14:19:03 GMT

Paul Elschot wrote:

> Op Tuesday 23 September 2008 20:26:18 schreef Michael McCandless:
>> Paul Elschot wrote:
>>> So, adding a document offset from the  documents/frequencies
>>> into the positions/payloads for each document would allow:
>>> -  bulk copying of the position/payloads during merging, and
>>> -  a more efficient implementation of TermPositions.skipTo()
>>>  in that decoding the positions from the last available skip
>>>  document to the target of skipTo() could be avoided.
>>> Is that correct?
>>
>> Yes, though this would also add cost of computing/writing/reading
>> that new offset, and would increase the index size.
>>
>>> That would indeed be invasive.
>>
>> Yes.  I think our time would likely be better spent working on using
>> PForDelta for freq/prox.
>
> To change the prox data to PForDelta, it's nice to be have
> bulk copies on prox working first. That would allow to change
> the total size of the prox data easily.
>
> But it appears to be easier to start with the doc/freq data, add
> more prox pointers there, and then change the prox data.
>
> PForDelta is fundamentally different from the existing index data
> because an encoded number cannot be accessed on a byte
> border. I don't know yet how to deal with that in the index
> data structures.

PForDelta encodes multiples of 32 ints at a time; so, the pointers
stored in the term dict, and in skip data, would presumably have to be
block number (or byte position in the file) plus offset within the
block.

And then an entire block must be fully decoded when loaded (I don't
think it's easy to partially decode with PForDelta, unless the block
luckily had no exceptions?), and then you start from the
offset-within-block you need.

I think a single block would hold more than one term's postings data
in general.  Ie these blocks are like "pages" in virtual memory.

Also I wonder how PForDelta would impact performance of queries that
rely heavily on skipping (AND queries), because the entire block must
be decoded to read a few of its ints.

However, with PForDelta I don't think we'd be able to do byte block
copying when merging, unless we were willing to keep the "seams" of
past merges present in the index files (the invasive change I was
referring to), and, no deletions applied.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message