lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: Flexible indexing design
Date Sun, 27 Apr 2008 10:28:11 GMT
Marvin Humphrey <marvin@rectangular.com> wrote:

> > > Seeking might get a little weird, I suppose.
> >
> > Maybe not?: if the container is only aware of the single InStream, and
> > say it's "indexed" with a multi-skip index, then when you ask
> > container to seek, it forwards the request to multi-skip which jumps
> > as close as it can, and then it asks the codec to skip docs until it's
> > at the requested docID.  Ie, the container can still be given a single
> > InStream, even though the codec "thinks" it's working with 3.
>
>  So, if I follow you...
>
> 1) When the requested doc number is far enough away to trigger skipping,
> segmentPostingList.skipTo() would operate on the skip stream and generate an
> opaque SkipDatum object (or something like that), which it would pass to the
> codec object.
>
> 2)The codec object would use the contents of the SkipDatum object -- an
> array of file pointers large enough to accommodate all of the InStreams,
> plus payload-type information as applicable -- to update itself into a
> consistent state just shy of or at the target doc.
>
> 3) The codec object will iterate through docs until it's at or past the
> target.
>
> Here's what's weird.  Say the codec is only operating on 1 stream but it
> thinks it's operating on 3.  Upon receiving the SkipDatum with 3 file
> pointers, it will seek the same InStream 3 times.
>
> That will only work if the file pointers are all the same -- which would be
> strange because then the second and third file pointers won't actually be
> marking the real data.
>
> Or perhaps the SkipDatum object will only contain one file pointer per
> "real" stream and the codec object will seek until it runs out of pointers
> in the stack?  i.e. It has three instreams, it receives a SkipDatum with
> only one file pointer, it seeks the first stream, then bails out and
> returns, ignoring the others.

Actually, I was picturing that the container does the seeking itself
(using skip data), to get "close" to the right point, and then it uses
the codec to step through single docs at a time until it's at or
beyond the right one.

Ie, no SkipDatum is created & passed to the codec, though likely
container should notify codec it just got skipped in case it has
internal state that cares.

Container is only aware of the single inStream, while codec can still
think its operating on 3 even if it's really 1 or 2.

> > > I think a variant of that read op could be created against various versions
> > > of the Lucene file format going back, making it possible to isolate and
> > > archive obsolete codecs and clean up the container classes.
> >
> > I like this approach.  It would allow us to decouple the codec from
> > how many (1, 2, 3) and which files are actually storing the data.
>
> The downside is that the codec object itself suddenly has to get a lot
> bigger to hold all the instreams.

But not many instances of the codec are created at once, right?  And
even so one could plug in their own (single stream) codec if need be?

> > > > When reading an index, the
> > > > Posting/PostingList should be more like TermBuffer than Term.
> > > >
> > > >
> > >
> > > Yes.
> > >
> > > Now's a good time to remark on a difference between KS and Lucene when
> > > reading the equivalent of TermPositions: In KS, all positions are read in
> > > one grab during ScorePost_Read_Record -- there's no nextPosition() method.
> > > That was a tradeoff I made primarily for simplicity's sake, since it meant
> > > that PhraseScorer could be implemented with integer arrays and pointer math.
> > > Reduced CPU overhead was another theoretical benefit, but I've never
> > > benchmarked it.
> > >
> > > If you didn't want to do that but you still wanted to implement a
> > > PhraseScorer based around PostingList objects rather than TermPositions
> > > objects, you have a bit of a quandary: PostingList only advances in
> > > doc-sized increments, because it doesn't have a nextPosition() method.  So,
> > > nextPosition() would have to be implemented by ScorePosting:
> > >
> > >  // Iterate over docs and positions.
> > >  while (postingList.next()) {
> > >   ScorePosting posting = postingList.getPosting();
> > >   while (posting.nextPostition()) {
> > >     int position = posting.GetPosition();
> > >     ...
> > >   }
> > >  }
> > >
> > > If we did that, it would require a change in the behavior for
> > > ScorePost_Next(), above.
> >
> > OK, I guess this is just the follow-through by KS on the same
> > philosophy above (precluding differentiating term-only vs
> > terms-and-positions queries).
>
> Not exactly.  While the devel branch of KS uses PostingList and the unified
> postings format, the maint branch uses a fairly close variant on the Lucene
> file format and a modified version of the TermDocs family.  The equivalent
> of TermPositions in KS maint *also* reads positions in bulk, just like KS
> devel does.  The only penalty for having a TermPositions object read
> positions in bulk like that is memory footprint (since each TermPositions
> object needs a growable array of 32-bit integers to store the bulk
> positions).

This is a tiny memory cost right?  A TermPositions instance gets
re-used each time you next() to the next term.

I think you also pay an added up-front (small) decoding cost in cases
where the consumer will only look at a subset of the positions before
next()'ing.  Eg a phrase search involving a rare term and a frequent
term.

But the good news is if this framework is plugglable, one can insert
their own codec to not do the up-front decoding of all positions per
term X doc.

> Reading positions in bulk or not is a decision that can be made
> independently of either the decision to go with a unified postings format or
> the decision to implement PostingList.  I provided the information on
> KinoSearch's behavior because I thought the code samples for PostingList I'd
> supplied begged the question, "How do you iterate through positions if
> PostingList doesn't know about them?"

OK.

> > I would think there is a non-trivial
> > performance cost for term-only queries in KS?
> >
>
> I'm not sure what the size of the effect is.  KS has its old indexing
> benchmarking app, but it doesn't have a search benchmarking app and I don't
> plan to write one.  I figure it's more profitable to finish the C porting of
> KS, write a Java binding and try to hook into the Lucene contrib
> benchmarking code than it is to either port the whole thing or write one
> from scratch.

Got it.

> The unified postings format seems likely to suffer some penalty on simple
> term queries and also likely to make up some of that ground on phrase
> queries.  We originally thought motivated users could compensate by spec'ing
> a simpler format for some fields, but now that I've actually implemented the
> unified format, I just don't see that approach as practical.
>
> So now, I've swung back to favoring the divide-by-data-type approach, but I
> want to keep the codec/container roles.
>
> Read-time isn't a problem.  We can outfit the codec object with multiple
> InStreams.  PostingList can continue to be a container which advances
> doc-at-a-time (regardless of whether we read positions in bulk a la
> KinoSearch or defer that till later a la Lucene).
>
> However, if we add all that stuff, the codec object gets a lot bigger.
> That means Posting, as I've envisioned and implemented it, is no longer
> suitable.  We'd need both PostingBuffer and Posting subclasses.

Yes.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message