lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Pooling of posting objects in DocumentsWriter
Date Sat, 12 Apr 2008 22:04:35 GMT

On Apr 12, 2008, at 2:29 AM, Michael McCandless wrote:

>> The total allocation for any given RawPosting can be calculated  
>> like so:
>>   offsetof(RawPosting, blob) + posting->content_len + posting- 
>> >aux_len
> Probably you also need to add in bytes lost to alignment requirements,
> when content_len + aux_len doesn't "align" to a 4/8 byte boundary?

Good thought, but "aux" isn't malloc'd and never gets dereferenced,  
just copied.  It's not even a real member -- it's just an artificially  
bounded slice of bytes within the single RawPosting allocation.

In fact, RawPosting "objects" aren't even malloc'd individually --  
they're assigned memory chunks by a custom MemoryPool object (which  
*does* have to maintain awareness of alignment issues).  You can't do  


Instead, you do this, which frees all the memory previously doled out  
by the MemoryPool, in a single action which is much faster than lots  
of individual calls to free():

    MemPool_Release_All(mem_pool);  /* mass liberation of RawPostings */

> It sounds like external sorter deals with RawPosting and need not know
> any details of the aux bytes being carried along.


> But: why distinguish between Posting and RawPosting?

RawPosting can't be subclassed, because of that flexible array  
member.  Flexible arrays, which are a lesser-known but reliable hack,  
have to be the last element in a C struct:

I wish RawPosting's design was a little less cryptic.  However,  
malloc() and free(), used by the normal object allocation  
infrastructure in KS, are comparatively expensive.  Using a custom  
allocation strategy speeds things up, at the cost of some code clarity  
-- essentially the same tradeoff that started this thread.

> Who uses Posting but not RawPosting?

Posting is an abstract base class.  MatchPosting, ScorePosting, etc.  
subclass it.

PostingList uses Posting directly.  Here's the function that  
implements SegPostingList's Next() method.

     SegPList_next(SegPostingList *self)
         InStream *const post_stream = self->post_stream;
         DelDocs  *const deldocs     = self->deldocs;
         Posting  *const posting     = self->posting;

         while (1) {
             /* Bail if we're out of docs. */
             if (self->count >= self->doc_freq) {
                 Post_Reset(posting, self->doc_base);
                 return 0;

             Post_Read_Record(posting, post_stream);

             /* If the doc isn't deleted... success! */
             if (deldocs == NULL) {
             else {
                 u32_t doc_minus_base = posting->doc_num - self- 
                 if ( !DelDocs_Get(deldocs, doc_minus_base) ) {

         return posting->doc_num;

> Or, maybe it's because you do two passes when indexing a document?
> First "invert" it, into array of Posting instances,

KS deals with Tokens grouped within arrays, rather than streams of  
Tokens a la Lucene.  Inverting is simply a matter of sorting the  
array, then grouping Tokens by common term text.  One RawPosting gets  
created for each unique term.

>>> OK I now see that what we call Posting really should be called
>>> PostingList: each instance of this class, in DW, tracks all  
>>> documents
>>> that contained that term.
>> I suppose that would be more accurate... though in KS, the  
>> "PostingList"
>> class is the replacement for TermDocs, TermPositions, etc.
> I see.  Though that is really a search-time vs indexing-time
> distinction, referencing the same thing (a "posting list").  Maybe
> RawPostingList, to make it clearer that these are the raw bytes for
> many docs produced during indexing?

That usage would fit within the current KS nomenclature, for whatever  
that's worth WRT Lucene. ;)

> In Lucene, a Posting instance holds many documents.

I think renaming that class might improve accuracy a bit (I assume  
it's not public).  I've never seen "posting" defined as something  
which could hold multiple documents -- it's either one term indexing  
one document (the definition KS uses), or one instance of one term  
indexing one document.

The latter meaning seems to be a little more common, but it didn't map  
as well to the class hierarchy I wanted to define.

> In Lucene we just flush a true segment when RAM is full.  For
> early versions of LUCENE-843 I was flushing "intermediate" segments
> (like KS's runs), but the cost of merging these runs wasn't that much
> better than the cost of merging normal segments, and, it added
> unexpected cost to close(), so I fell back to just flushing normal
> segments.

Really, you gave up on that?  Then aren't you moving stored fields and  
term vectors around more often?

> ow do multiple threads interact here?  Does each one work with its
> own ExternalSorter, and then all threads and all of their runs do
> merge sort at the end?  Or do multiple threads feed a single external
> sorter?  How finely do you require thread synchronization?

KS doesn't presently work with threads -- Perl's threading model is  
kind of awkward.  But here's the way it would probably work best:  One  
external sorter object (a PostingPoolQueue).  One sorted run object (a  
PostingPool) per thread.  Thread synchronization is required whenever  
a PostingPool gets written to temp storage, since all pools share a  
common OutStream.  Synchronization is also required at finish-time  
during periodic calls to PostPoolQ_Refill(pool_q), which is when  
RawPosting objects are merge sorted from the PostingPools into a cache  
in the PostingPoolQueue.

The PostingPool objects, each of which represents a sorted run,  
actually do most of the work.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message