lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Pooling of posting objects in DocumentsWriter
Date Fri, 11 Apr 2008 00:10:13 GMT

On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote:
>> IMO, the abstract base Posting class should not track text.  It  
>> should
>> include only one datum: a document number.  This keeps it in line  
>> with the
>> simplest IR definition for a "posting": one document matching one  
>> term.
> But how do you then write out a segment with the terms packed, in
> sorted order?  Your "generic" layer needs to know how to sort these
> Posting lists by term text, right?

Yes.  But the generic layer is the semi-serialized export form  
RawPosting, not the abstract base class Posting.  (Would it be clearer  
if RawPosting was named "FlatPosting"?)

   struct kino_Posting {
       kino_VirtualTable* _;
       kino_ref_t ref;
       chy_u32_t doc_num;

   struct kino_RawPosting {
       kino_VirtualTable* _;
       kino_ref_t ref;
       chy_u32_t doc_num;
       chy_u32_t freq;
       chy_u32_t content_len;
       chy_u32_t aux_len;
       char blob[1];    /* flexible array */

The last member of RawPosting, "blob", uses the C "flexible array"  
hack which takes advantage of C's lack of bounds checking.  A  
RawPosting consists of a single contiguous memory allocation, but the  
size of that allocation varies depending on how much information the  
object needs to carry.

The "blob" is divided into two parts: content, and "aux".  Content is  
term text.  "aux" is encoded material which will be written to the  
postings file.  The total allocation for any given RawPosting can be  
calculated like so:

    offsetof(RawPosting, blob) + posting->content_len + posting->aux_len

Sorting is done by comparing first the "content" component of blob  
using memcmp(), then breaking ties with content_len and finally,  

The write method invoked by PostingsWriter is the same --  
RawPost_Write_Record -- no matter what Posting subclass is assigned to  
the field.  The first part of the implementing function  
RawPost_write_record may look familiar to you, as it uses a close  
variant of Lucene's algo for encoding doc_num and freq together.  Note  
the last line, though:

     RawPost_write_record(RawPosting *self, OutStream *outstream,
                          u32_t last_doc_num)
         const u32_t delta_doc = self->doc_num - last_doc_num;
         char  *const aux_content = self->blob + self->content_len;

         /* Write doc num, freq. */
         if (self->freq == 1) {
             const u32_t doc_code = (delta_doc << 1) | 1;
             OutStream_Write_C32(outstream, doc_code);
         else {
             const u32_t doc_code = delta_doc << 1;
             OutStream_Write_C32(outstream, doc_code);
             OutStream_Write_C32(outstream, self->freq);

         /* Print prepared data. */
         OutStream_Write_Bytes(outstream, aux_content, self->aux_len);

The idea is that information other than doc num and freq gets  
serialized during the export operation which creates the RawPosting --  
then the last line of RawPost_write_record prints the pre-serialized  
data to the OutStream.

> OK I now see that what we call Posting really should be called
> PostingList: each instance of this class, in DW, tracks all documents
> that contained that term.

I suppose that would be more accurate... though in KS, the  
"PostingList" class is the replacement for TermDocs, TermPositions, etc.

> Whereas for KS, Posting is a single occurrence of term in a single  
> doc, right?

It's one term matching one document.

These are the definitions I'm using, as set out in  
KinoSearch::Docs::IRTheory at <>:

     * document - An atomic unit of retrieval.
     * term - An attribute which describes a document.
     * posting - One term indexing one document.
     * term list - The complete list of terms which describe a document.
     * posting list - The complete list of documents which a term  
     * inverted index - A data structure which maps from terms to  

> Does a Posting contain all
> occurrences of the term in the doc (multiple positions) or only one?

In KS, a Posting object can represent multiple occurrences of a term  
within a single document.  That was somewhat arbitrary on my part.

> How do you do buffering/flushing?

RawPostings are fed into an object which handles external sorting.   
The external sorter tracks RAM usage, and checks once per document  
whether it should flush a sorted run to temporary storage.

After the last document gets added, the external sorter flips to  
fetching mode.  The pre-sorted runs are merged into a sorted stream of  
RawPostings, and the postings and lexicon files get written out.

>> Then, for search-time you have a PostingList class which takes the  
>> place of
>> TermDocs/TermPositions, and uses an underlying Posting object to  
>> read the
>> file.  (PostingList and its subclasses don't know anything about file
>> formats.)
> Wouldn't PostingList need to know something of the file format?  EG
> maybe it's a sparse format (docID or gap encoded each time), or, it's
> non-sparse (like norms, column-stride fields).

Each subclass of Posting implements its own Read_Record() method.   
PostingList just hands off the InStream to its internal Posting object  
-- it doesn't know or care how much data from the InStream the Posting  
consumes with each Read_Record().

>> As I have argued before, the key is to have each Posting subclass  
>> wholly
>> define a file format.  That makes them pluggable, breaking the  
>> tight binding
>> between the Lucene codebase and the Lucene file format spec.
> It's not just Posting that defines the file format.

I could have been clearer -- I meant a postings file format.

> I think TermVectorsWriter should be seen as a consumer of the
> "inversion" plugin API.  It's just that, unlike the frq/prx writer,
> which flushes when RAM is full, the TermVectorsWriter flushes after
> each doc.  Ie, the generic code does the inversion, feeding "you"
> Posting occurrences, and "you" write this to a file however you want.

Yes, that's how it works in KS, too.  :)

I haven't written a formal Inversion class yet, though.  Upon  
reflection, perhaps it would be better to have "Inversion" be specific  
to a single field, and "InvertedDocument" collect everything together?

    public void addDocument(Document document) {
       InvertedDocument inverteDoc = invert(document)
       for (i = 0; i < writers.length; i++) {

I think that's better in KS, because I can rename "TokenBatch" -- the  
class that actually does the low-level inverting and that subclasses  
of Analyzer need to deal with -- to "Inversion".

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message