lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Pooling of posting objects in DocumentsWriter
Date Tue, 08 Apr 2008 23:31:04 GMT

On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:
> I've actually been working on factoring DocumentsWriter, as a first
> step towards flexible indexing.

The way I handled this in KS was to turn Posting into a class akin to  
TermBuffer: the individual Posting object persists, but its values  
change.

Meanwhile, each Posting subclass has a Read_Raw method which generates  
a "RawPosting".  RawPosting objects are a serialized, sortable, lowest  
common denominator form of Posting which every subclass must be able  
to export. They're allocated from a specialized MemoryPool, making  
them cheap to manufacture and to release.

RawPosting is the only form PostingsWriter is actually required to  
know about:

    // PostingsWriter loop:
    while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
       writeRawPosting(rawPosting);
    }

> I agree we would have an abstract base Posting class that just tracks
> the term text.

IMO, the abstract base Posting class should not track text.  It should  
include only one datum: a document number.  This keeps it in line with  
the simplest IR definition for a "posting": one document matching one  
term.

    Posting:        doc num (abstract)
    MatchPosting:   doc num
    ScorePosting:   doc num, freq, per-doc boost, positions
    RichPosting:    doc num, freq, positions with per-position boost
    PayloadPosting: doc num, payload

Then, for search-time you have a PostingList class which takes the  
place of TermDocs/TermPositions, and uses an underlying Posting object  
to read the file.  (PostingList and its subclasses don't know anything  
about file formats.)

Each Posting subclass is associated with a subclass of TermScorer  
which implements its own Posting-subclass-specific scoring algorithm.

    // MatchPostingScorer scoring algo ...
    while (postingList.next()) {
       MatchPosting posting = postingList.getPosting();
       collector.collect(posting.getDocNum(), 1.0);
    }

    // ScorePostingScorer scoring algo...
    while (postingList.next()) {
       ScorePosting posting = (ScorePosting)postingList.getPosting();
       int freq = posting.getFreq();
       float score = freq < TERMSCORER_SCORE_CACHE_SIZE
                   ? scoreCache[freq]            // cache hit
                   : sim.tf(freq) * weightValue;
       collector.collect(posting.getDocNum(), score);
    }

> And then the code that writes the current index format would plug into
> this and should be fairly small and easy to understand.

I'm pessimistic that that anything that writes the current index  
format could be "easy to understand", because the spec is so  
dreadfully convoluted.

As I have argued before, the key is to have each Posting subclass  
wholly define a file format.  That makes them pluggable, breaking the  
tight binding between the Lucene codebase and the Lucene file format  
spec.

> Then there would also be plugins that just tap into the entire
> document (don't need inversion), like FieldsWriter.


Yes.  Here's how things are set up in KS:

    InvIndexer
       SegWriter
          DocWriter
          PostingsWriter
             LexWriter
          TermVectorsWriter
          // plug in more writers here?

Ideally, all of the writers under SegWriter would be subclasses of an  
abstract SegDataWriter class, and would implement addInversion() and  
addSegment(). SegWriter.addDoc() would look something like this:

    addDoc(Document doc) {
       Inversion inversion = invert(doc);
       for (int i = 0; i < writers.size; i++) {
          writers[i].addInversion(inversion);
       }
    }

In practice, three of the writers are required (one for term  
dictionary/lexicon, one for postings, and one for some form of  
document storage), but the design allows for plugging in additional  
SegDataWriter subclasses.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message