lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Flexible indexing design
Date Thu, 24 Apr 2008 22:54:17 GMT

On Apr 24, 2008, at 4:47 AM, Michael McCandless wrote:
>> Seeking might get a little weird, I suppose.
> Maybe not?: if the container is only aware of the single InStream, and
> say it's "indexed" with a multi-skip index, then when you ask
> container to seek, it forwards the request to multi-skip which jumps
> as close as it can, and then it asks the codec to skip docs until it's
> at the requested docID.  Ie, the container can still be given a single
> InStream, even though the codec "thinks" it's working with 3.

So, if I follow you...

1) When the requested doc number is far enough away to trigger  
skipping, segmentPostingList.skipTo() would operate on the skip stream  
and generate an opaque SkipDatum object (or something like that),  
which it would pass to the codec object.

2)The codec object would use the contents of the SkipDatum object --  
an array of file pointers large enough to accommodate all of the  
InStreams, plus payload-type information as applicable -- to update  
itself into a consistent state just shy of or at the target doc.

3) The codec object will iterate through docs until it's at or past  
the target.

Here's what's weird.  Say the codec is only operating on 1 stream but  
it thinks it's operating on 3.  Upon receiving the SkipDatum with 3  
file pointers, it will seek the same InStream 3 times.

That will only work if the file pointers are all the same -- which  
would be strange because then the second and third file pointers won't  
actually be marking the real data.

Or perhaps the SkipDatum object will only contain one file pointer per  
"real" stream and the codec object will seek until it runs out of  
pointers in the stack?  i.e. It has three instreams, it receives a  
SkipDatum with only one file pointer, it seeks the first stream, then  
bails out and returns, ignoring the others.

>> I think a variant of that read op could be created against various  
>> versions
>> of the Lucene file format going back, making it possible to isolate  
>> and
>> archive obsolete codecs and clean up the container classes.
> I like this approach.  It would allow us to decouple the codec from
> how many (1, 2, 3) and which files are actually storing the data.

The downside is that the codec object itself suddenly has to get a lot  
bigger to hold all the instreams.

>>> When reading an index, the
>>> Posting/PostingList should be more like TermBuffer than Term.
>> Yes.
>> Now's a good time to remark on a difference between KS and Lucene  
>> when
>> reading the equivalent of TermPositions: In KS, all positions are  
>> read in
>> one grab during ScorePost_Read_Record -- there's no nextPosition()  
>> method.
>> That was a tradeoff I made primarily for simplicity's sake, since  
>> it meant
>> that PhraseScorer could be implemented with integer arrays and  
>> pointer math.
>> Reduced CPU overhead was another theoretical benefit, but I've never
>> benchmarked it.
>> If you didn't want to do that but you still wanted to implement a
>> PhraseScorer based around PostingList objects rather than  
>> TermPositions
>> objects, you have a bit of a quandary: PostingList only advances in
>> doc-sized increments, because it doesn't have a nextPosition()  
>> method.  So,
>> nextPosition() would have to be implemented by ScorePosting:
>>  // Iterate over docs and positions.
>>  while ( {
>>    ScorePosting posting = postingList.getPosting();
>>    while (posting.nextPostition()) {
>>      int position = posting.GetPosition();
>>      ...
>>    }
>>  }
>> If we did that, it would require a change in the behavior for
>> ScorePost_Next(), above.
> OK, I guess this is just the follow-through by KS on the same
> philosophy above (precluding differentiating term-only vs
> terms-and-positions queries).

Not exactly.  While the devel branch of KS uses PostingList and the  
unified postings format, the maint branch uses a fairly close variant  
on the Lucene file format and a modified version of the TermDocs  
family.  The equivalent of TermPositions in KS maint *also* reads  
positions in bulk, just like KS devel does.  The only penalty for  
having a TermPositions object read positions in bulk like that is  
memory footprint (since each TermPositions object needs a growable  
array of 32-bit integers to store the bulk positions).

Reading positions in bulk or not is a decision that can be made  
independently of either the decision to go with a unified postings  
format or the decision to implement PostingList.  I provided the  
information on KinoSearch's behavior because I thought the code  
samples for PostingList I'd supplied begged the question, "How do you  
iterate through positions if PostingList doesn't know about them?"

> I would think there is a non-trivial
> performance cost for term-only queries in KS?

I'm not sure what the size of the effect is.  KS has its old indexing  
benchmarking app, but it doesn't have a search benchmarking app and I  
don't plan to write one.  I figure it's more profitable to finish the  
C porting of KS, write a Java binding and try to hook into the Lucene  
contrib benchmarking code than it is to either port the whole thing or  
write one from scratch.

The unified postings format seems likely to suffer some penalty on  
simple term queries and also likely to make up some of that ground on  
phrase queries.  We originally thought motivated users could  
compensate by spec'ing a simpler format for some fields, but now that  
I've actually implemented the unified format, I just don't see that  
approach as practical.

So now, I've swung back to favoring the divide-by-data-type approach,  
but I want to keep the codec/container roles.

Read-time isn't a problem.  We can outfit the codec object with  
multiple InStreams.  PostingList can continue to be a container which  
advances doc-at-a-time (regardless of whether we read positions in  
bulk a la KinoSearch or defer that till later a la Lucene).

However, if we add all that stuff, the codec object gets a lot  
bigger.  That means Posting, as I've envisioned and implemented it, is  
no longer suitable.  We'd need both PostingBuffer and Posting  

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message