Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <65d3176c0809260030t4e600cf5g43d5fef6e6637029@mail.gmail.com>
Date: Fri, 26 Sep 2008 01:30:15 -0600
From: "Nathan Kurz" <nate@verse.com>
To: lucy-dev@lucene.apache.org
Subject: Re: Posting codecs
In-Reply-To: <65d3176c0809212335l64baebfw5abc259fbc417f8d@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <f31e99f40809131356w402aa18avf97ed6471301123@mail.gmail.com>
	 <581D4B54-2ABA-4DEA-AC5A-78417E3348A9@rectangular.com>
	 <65d3176c0809142202q39d94a83o985df3c95a7867a5@mail.gmail.com>
	 <f31e99f40809171123j39032e94q3966631889634593@mail.gmail.com>
	 <65d3176c0809171316s31c65048ue6afcb540d845694@mail.gmail.com>
	 <7AFF05A2-FE1E-4A35-B3AE-E111DD10F068@rectangular.com>
	 <65d3176c0809191125g3c7fd1fav88a4e7c86f9b41ec@mail.gmail.com>
	 <FDBA3210-1DB0-4D26-9B3B-1305D4E2BD56@rectangular.com>
	 <65d3176c0809212335l64baebfw5abc259fbc417f8d@mail.gmail.com>

I started looking at the code a bit, trying to figure out a more
incremental way of approaching this.  I now think there might be room
for one, although as always my urge to boil oceans is hard to
suppress.

The first step would be to switch InStream to using mmap() instead of
its user level buffering.   This looks surprisingly easy, presuming we
are allowed to ignore Windows' compatibility, or at least Windows'
performance.

The second step would then be to expose InStream->buf,  probably via
an accessor so that EOF can be reported.   I said I wasn't intending
to expose the raw buffer,  but once we are using mmap we shouldn't
ever need to call refill().  There would need to be some implied
rules, such as never go beyond the current Posting.

The third step is to change ScorePost->read_record to decode the
VBytes directly from the raw buffer, or perhaps better, to create a
FastScorePost subclass that does this.  InStream->seek() is  called
after the buffer is read to update Instream->buf and assert() any
reads that go to far.

The nice part about this approach is that we should get a nice
speed-up for the common use cases while requiring very few API
changes.  Old scorers et al can still continue as they did before, and
new ones can change over gradually.

The downside is that each Scorer remains tied to particular index
format.  Long-term I still think this is disastrous, but in the short
term it's not that bad.  But perhaps we can delay solving this problem
until you get P4Delta ready to go.

Sound plausible?

Nathan Kurz
nate@verse.com

On Mon, Sep 22, 2008 at 12:35 AM, Nathan Kurz <nate@verse.com> wrote:
> On Fri, Sep 19, 2008 at 8:51 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
>> Let's set a goal of implementing PForDelta, and figure out how to get there
>> from here.
>
> Sure, that seems like a fine goal.  I'm not sure if you meant it this
> way, but I think it would be great to add support for PForDelta to the
> existing VByte support rather than just replacing it.  While PForDelta
> might fully replace VByte eventually, it would be good to design an
> architecture that allows for multiple file formats, so that tricks
> like using Lucene index files directly are theoretically possible, and
> so that the path for creating custom file formats is better tested.
>
>>> You need to figure out some to decode the entire posting with fewer
>>> function calls, and this is where the bulk of them are coming from.
>>> I'd suggest having the Store level return an undecoded raw posting,
>>> and let the Posting class decode the parts it wants.  That way the
>>> VByte code can be macros that work on a local buffer in a tight loop.
>>
>> It's a little tricky to give anything other than the InStream object itself
>> direct access to the raw buffer.  It's inevitable that the bytes that form a
>> compressed integer will straddle a buffer boundary from time to time.
>
> OK.  I wasn't hoping to give access to the raw buffer, but to a cooked
> one guaranteed to contain the entire posting.  But I'd forgotten that
> the size of the compressed posting isn't known in advance!  The
> important part, though, is that the entire posting be decompressed in
> a tight loop with minimal function call overhead.  So instead of my
> first suggestion, can we have the Store return a  fully decoded
> Posting?
>
> More fully fleshed out  (up to partially-informed), I'd like for a
> system that looks something like this:
>
> 1) Scorer->init creates a template Posting which is passed to the
> Store for approval.  Store initializes itself and answers
> affirmatively if it can handle the particular flavor of Posting
> proposed.  For example, Store->init_posting(PositionPosting) might be
> refused by a Store than only handles MatchPostings due to index
> limitations.  Alternatively, a Store managing an index that has
> position information available might re-wire itself to use a faster
> method to fill a MatchPosting than a full PositionPosting.
>
> 2) Scoring loop starts and Scorer calls
> Store->fill_posting(PositionPosting, min_doc_number) .  This is
> analogous to Scorer->SkipTo(), but decodes the entire posting from the
> index in a single pass.  The Posting has nominal control over its own
> fields (resizes its own position buffer, for example) but the Store is
> allowed to write data directly to the struct and the Scorer reads
> fields directly from them.
>
> My goals here are to have a system that dramatically speeds up the the
> reads from the index, makes it possible for existing Scorers to work
> with new index formats, and allows for easy creation of custom Scorers
> for use with custom indexes.  My guess is that a move to this system
> would give an immediate 2x speedup, while making it easier to add
> custom features and future optimizations.  I think the main downside
> is that it violates most people's conception of a standard Object
> Oriented Model.   Personally, my main doubts are about the
> initialization/negotiation phase --- perhaps there is some better way
> to do this?
>
> And I apologize for my ad hoc terminology.  It's partly that I'm not
> familiar with the proper Kinosearch terminology, and partly that the
> levels that are there don't quite map up with my mental model.  What I
> want is for the Posting to be an intermediary between the Scorer and
> the Store, where the Posting has a known data layout and only the
> Store knows about the index internals, be it using VByte or PForDelta.
>  I'd be happy to call the Store an Index, apart from the confusion
> between the file and the class, but I'm not comfortable viewing it as
> a Stream.  Is there a more standard way to express this?
>
>>> The last thing (sort of a non-thing) is that to my surprise the
>>> double/triple buffering of the Posting data doesn't seem to have much
>>> of a negative effect.  I still think it's worth trying to avoid this,
>>> but it's barely on the current radar.
>>
>> The double/triple buffering may not cause a slowdown, but it's worth going
>> with mmap for the backing buffer on InStream simply for memory footprint.
>>  Since all index files are read-only once committed, they can be shared.
>>  That means if you have multiple processes with multiple IndexReaders all
>> reading the same file, they can all effectively use the same backing buffer,
>> and we'll let the OS manage page swapping.
>
> I agree that there a multiple other reasons for using mmap here.  I
> definitely want a design that makes it easy to drop in mmap versions
> of the index readers for all the reasons that you specify.  I'm also
> hoping that we can make the rest of the search process efficient
> enough that the savings of moving to single buffering does become
> worthwhile.  But we have a ways to go until that is the case:  maybe a
> 8x speedup from the recent baseline before it starts to be a limiting
> factor?  I'm optimistic that this level of speedup is possible,
> though.
>
> Nathan Kurz
> nate@verse.com
>