lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Per-document Payloads
Date Sun, 21 Oct 2007 00:20:30 GMT

On Oct 19, 2007, at 3:53 PM, Michael Busch wrote:

> The next question would be how to store the per-doc payloads (PDP). If
> all values have the same length (as the unique docIds), then we should
> store them as efficiently as possible, like the norms. However, we  
> still
> want to offer the flexibility of having variable-length values. For  
> this
> case we could use a new data structure similar to our posting list.
>
> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> SkipList>
> FixedLengthPDPList    --> <Payload>^SegSize
> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> Payload               --> Byte^PayloadLength
> PayloadLength         --> VInt
> SkipList              --> see frq.file

There's another approach, which has the following advantages:

   * Simpler.
   * Pluggable.
   * More future proof.
   * More closely models IR Theory.
   * Easier for other implementations to deal with.
   * Breaks the tight binding between Lucene and its file format.

Start with a Posting base class.

   public class Posting {
     private int docNum;
     private int lastDocNum = 0;

     public int getDocNum { return docNum; }

     public void read(IndexInput inStream) {
       docNum += inStream.readVInt();
     }

     public void write(IndexOutput outStream) {
       outStream.writeVInt(docNum - lastDocNum);
     }
   }

Then, PostingList (subclassed by SegPostingList and MultiPostingList,  
naturally).

   public abstract class PostingList {
      public abstract Posting getPosting();
      public abstract boolean next() throws IOException;
      public boolean skipTo(int target) throws IOException;
   }

Each field gets its own "postings" file within the segment, named  
_SEGNUM_FIELDNUM.p, where SEGNUM and FIELDNUM are encoded using base  
36.  Each of these files is a solid stack of serialized Postings.

Posting subclasses like ScorePosting, PayloadPosting, etc, implement  
their own read() and write() methods.  Thus, Posting subclasses  
wholly define their own file format -- instead of the current,  
brittle design, where read/write code is dispersed over multiple  
classes.  If some Posting types become obsolete, they can be  
deprecated, but PostingList and its subclasses won't require the  
addition of crufty special case code to stay back-compatible.

There's more (I've written a working implementation), but that's the  
gist.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message