lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: Pooling of posting objects in DocumentsWriter
Date Thu, 10 Apr 2008 09:37:29 GMT
Marvin Humphrey <marvin@rectangular.com> wrote:
>
>  On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:
>
> > I've actually been working on factoring DocumentsWriter, as a first
> > step towards flexible indexing.
> >
>
>  The way I handled this in KS was to turn Posting into a class akin to
> TermBuffer: the individual Posting object persists, but its values change.
>
>  Meanwhile, each Posting subclass has a Read_Raw method which generates a
> "RawPosting".  RawPosting objects are a serialized, sortable, lowest common
> denominator form of Posting which every subclass must be able to export.
> They're allocated from a specialized MemoryPool, making them cheap to
> manufacture and to release.
>
>  RawPosting is the only form PostingsWriter is actually required to know
> about:
>
>    // PostingsWriter loop:
>    while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
>       writeRawPosting(rawPosting);
>
>    }
>
> > I agree we would have an abstract base Posting class that just tracks
> > the term text.
> >
>
>  IMO, the abstract base Posting class should not track text.  It should
> include only one datum: a document number.  This keeps it in line with the
> simplest IR definition for a "posting": one document matching one term.

But how do you then write out a segment with the terms packed, in
sorted order?  Your "generic" layer needs to know how to sort these
Posting lists by term text, right?

>    Posting:        doc num (abstract)
>    MatchPosting:   doc num
>    ScorePosting:   doc num, freq, per-doc boost, positions
>    RichPosting:    doc num, freq, positions with per-position boost
>    PayloadPosting: doc num, payload

OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all documents
that contained that term.  Whereas for KS, Posting is a single
occurrence of term in a single doc, right?  Does a Posting contain all
occurrences of the term in the doc (multiple positions) or only one?

How do you do buffering/flushing?  After each document do you re-sweep
your Posting instances and write them into a single segment?  Or do
accumulate many of these Posting instances (so many docs are held in
this form) and when RAM is full you flush to disk?

>  Then, for search-time you have a PostingList class which takes the place of
> TermDocs/TermPositions, and uses an underlying Posting object to read the
> file.  (PostingList and its subclasses don't know anything about file
> formats.)

Wouldn't PostingList need to know something of the file format?  EG
maybe it's a sparse format (docID or gap encoded each time), or, it's
non-sparse (like norms, column-stride fields).

>  Each Posting subclass is associated with a subclass of TermScorer which
> implements its own Posting-subclass-specific scoring algorithm.
>
>    // MatchPostingScorer scoring algo ...
>    while (postingList.next()) {
>       MatchPosting posting = postingList.getPosting();
>       collector.collect(posting.getDocNum(), 1.0);
>    }
>
>    // ScorePostingScorer scoring algo...
>    while (postingList.next()) {
>       ScorePosting posting = (ScorePosting)postingList.getPosting();
>       int freq = posting.getFreq();
>       float score = freq < TERMSCORER_SCORE_CACHE_SIZE
>                   ? scoreCache[freq]            // cache hit
>                   : sim.tf(freq) * weightValue;
>       collector.collect(posting.getDocNum(), score);
>
>    }
>
>
> > And then the code that writes the current index format would plug into
> > this and should be fairly small and easy to understand.
> >
>
>  I'm pessimistic that that anything that writes the current index format
> could be "easy to understand", because the spec is so dreadfully convoluted.

I'm quite a bit more optimistic here.

>  As I have argued before, the key is to have each Posting subclass wholly
> define a file format.  That makes them pluggable, breaking the tight binding
> between the Lucene codebase and the Lucene file format spec.

It's not just Posting that defines the file format.  Things like
stored fields, norms, column-stride fields, have nothing to do with
inversion.  So these writers/readers should "plug in" at a layer above
the inversion?  OK, I see these below:

> > Then there would also be plugins that just tap into the entire
> > document (don't need inversion), like FieldsWriter.
> >
>
>
>  Yes.  Here's how things are set up in KS:
>
>    InvIndexer
>       SegWriter
>          DocWriter
>          PostingsWriter
>             LexWriter
>          TermVectorsWriter
>          // plug in more writers here?
>
>  Ideally, all of the writers under SegWriter would be subclasses of an
> abstract SegDataWriter class, and would implement addInversion() and
> addSegment(). SegWriter.addDoc() would look something like this:
>
>    addDoc(Document doc) {
>       Inversion inversion = invert(doc);
>       for (int i = 0; i < writers.size; i++) {
>          writers[i].addInversion(inversion);
>       }
>    }

I think TermVectorsWriter should be seen as a consumer of the
"inversion" plugin API.  It's just that, unlike the frq/prx writer,
which flushes when RAM is full, the TermVectorsWriter flushes after
each doc.  Ie, the generic code does the inversion, feeding "you"
Posting occurrences, and "you" write this to a file however you want.

>  In practice, three of the writers are required (one for term
> dictionary/lexicon, one for postings, and one for some form of document
> storage), but the design allows for plugging in additional SegDataWriter
> subclasses.

OK.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message