lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible indexing design
Date Mon, 28 Apr 2008 15:21:20 GMT

On Apr 27, 2008, at 3:28 AM, Michael McCandless wrote:

> Actually, I was picturing that the container does the seeking itself
> (using skip data), to get "close" to the right point, and then it uses
> the codec to step through single docs at a time until it's at or
> beyond the right one.

I believe it makes sense to have the skip stream be the sole  
responsibility of the container.

In any case, the skip stream would would be the *only* stream the  
container knows about.  All others streams would be internal to the  
codec: frq, prx, etc.

> Ie, no SkipDatum is created & passed to the codec, though likely
> container should notify codec it just got skipped in case it has
> internal state that cares.

But you'll have to seek frq, prx, etc, though -- right?  Say you've  
got a large PostingList currently located near its top and you want  
skip to halfway through.  How do you do that without skipping the  
underlying streams, which the codec knows about but the container  
doesn't?

If you don't skip the internal state of the codec including the  
locations of its InStreams, you just have to keep calling next() on it  
over and over -- an un-optimized skipTo() implementation.  That can't  
be what you want; presumably we're miscommunicating somehow.

> Container is only aware of the single inStream, while codec can still
> think its operating on 3 even if it's really 1 or 2.

I don't understand.  If you have three streams, all of them are going  
to have to get skipped, right?

>> The downside is that the codec object itself suddenly has to get a  
>> lot
>> bigger to hold all the instreams.
>
> But not many instances of the codec are created at once, right?

Yes.

That's not how I originally implemented Posting.  It was a small  
class, and PostingList originally had a Bulk_Read() method that read 1  
kB worth of Posting objects into a ByteBuf, stacked end to end.  But  
we agree that the codec class will need to be larger.

> And even so one could plug in their own (single stream) codec if  
> need be?

Sure.

The question is how we set up PostingList's iterator.  In KS right  
now, SegPList_Next() calls Posting's Read_Record() method, which takes  
an InStream as an argument -- Posting doesn't maintain any streams  
internally.

As soon as the codec starts reading from an indeterminate number of  
streams, though, having the container pass them in for each read op  
won't work any more.  The codec will have to be responsible for all  
data streams.

>> The only penalty for having a TermPositions object read
>> positions in bulk like that is memory footprint (since each  
>> TermPositions
>> object needs a growable array of 32-bit integers to store the bulk
>> positions).
>
> This is a tiny memory cost right?  A TermPositions instance gets
> re-used each time you next() to the next term.

With large documents, the buffer size requirements can get pretty big  
-- consider how often the term "the" might appear in a 1000-page  
novel.  Lucene and its relatives don't work very well with novel-sized  
documents anyway, though, so for better or worse, I blew it off.

> I think you also pay an added up-front (small) decoding cost in cases
> where the consumer will only look at a subset of the positions before
> next()'ing.  Eg a phrase search involving a rare term and a frequent
> term.

Yes, you're right about that.  The KS phrase scorer has less function  
call overhead, though -- it's a more limited design, with no 'slop'  
support, and it operates using pointer math on the arrays of positions  
directly rather than having to make a lot of accessor calls.

My guess is that it's a wash, algorithm-wise.  It seems likely that  
file format would have a significant effect, but I would be surprised  
to see phrase scorer performance in KS and Lucene diverge all that  
much as a result of the algorithmic implementation.

That's why I asserted that the main motivation for bulk-reading  
positions in KS was simplicity, rather than performance or something  
else.

> But the good news is if this framework is plugglable, one can insert
> their own codec to not do the up-front decoding of all positions per
> term X doc.

Yes, that would work.  You would need different Scorer implementations  
to deal with codecs which read the same file format yet are  
implemented differently... but that's a feature. :)

>> We'd need both PostingBuffer and Posting subclasses.
>
> Yes.

OK, I think we're almost at the point where we can hack up prototypes  
for Lucene implementations of PostingBuffer and PostingList that read  
the current file format.  I think seeing those would clarify things.

Ironically, I'm not sure exactly how Posting fits into the equation at  
read-time anymore, but I think that will work itself out as we get  
back to indexing.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message