lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Pooling of posting objects in DocumentsWriter
Date Sun, 13 Apr 2008 09:28:36 GMT
Marvin Humphrey <> wrote:
>  On Apr 12, 2008, at 2:29 AM, Michael McCandless wrote:
> >
> > > The total allocation for any given RawPosting can be calculated like so:
> > >
> > >  offsetof(RawPosting, blob) + posting->content_len + posting->aux_len
> > >
> >
> > Probably you also need to add in bytes lost to alignment requirements,
> > when content_len + aux_len doesn't "align" to a 4/8 byte boundary?
> >
> Good thought, but "aux" isn't malloc'd and never gets dereferenced, just
> copied.  It's not even a real member -- it's just an artificially bounded
> slice of bytes within the single RawPosting allocation.
> In fact, RawPosting "objects" aren't even malloc'd individually -- they're
> assigned memory chunks by a custom MemoryPool object (which *does* have to
> maintain awareness of alignment issues).  You can't do this:

Right, it's that alignment waste that I'd think you need to "amortize"
onto the cost of each RawPosting.

>    free(raw_posting);
> Instead, you do this, which frees all the memory previously doled out by
> the MemoryPool, in a single action which is much faster than lots of
> individual calls to free():
>    MemPool_Release_All(mem_pool);  /* mass liberation of RawPostings */

OK got it.  Pools are nice when they "match" :)

> > But: why distinguish between Posting and RawPosting?
> >
> RawPosting can't be subclassed, because of that flexible array member.
> Flexible arrays, which are a lesser-known but reliable hack, have to be the
> last element in a C struct:

Alas we can't use such neat tricks in Javaland...

> I wish RawPosting's design was a little less cryptic.  However, malloc()
> and free(), used by the normal object allocation infrastructure in KS, are
> comparatively expensive.  Using a custom allocation strategy speeds things
> up, at the cost of some code clarity -- essentially the same tradeoff that
> started this thread.

Makes sense.

> > Who uses Posting but not RawPosting?
> >
> Posting is an abstract base class.  MatchPosting, ScorePosting, etc.
> subclass it.
> PostingList uses Posting directly.  Here's the function that implements
> SegPostingList's Next() method.
>     u32_t
>     SegPList_next(SegPostingList *self)
>     {
>         InStream *const post_stream = self->post_stream;
>         DelDocs  *const deldocs     = self->deldocs;
>         Posting  *const posting     = self->posting;
>         while (1) {
>             /* Bail if we're out of docs. */
>             if (self->count >= self->doc_freq) {
>                 Post_Reset(posting, self->doc_base);
>                 return 0;
>             }
>             self->count++;
>             Post_Read_Record(posting, post_stream);
>             /* If the doc isn't deleted... success! */
>             if (deldocs == NULL) {
>                 break;
>             }
>             else {
>                 u32_t doc_minus_base = posting->doc_num - self->doc_base;
>                 if ( !DelDocs_Get(deldocs, doc_minus_base) ) {
>                     break;
>                 }
>             }
>         }
>         return posting->doc_num;
>     }

I see.  So Posting exposes the doc_num so the generic code
(SegPList_next) can check if the doc is deleted when iterating.

What is self->doc_base?  In Lucene, each segment doesn't "know" it's
doc base; MultiSegmentReader handles that "above" this layer.

> > Or, maybe it's because you do two passes when indexing a document?
> > First "invert" it, into array of Posting instances,
> >
> KS deals with Tokens grouped within arrays, rather than streams of Tokens a
> la Lucene.  Inverting is simply a matter of sorting the array, then grouping
> Tokens by common term text.  One RawPosting gets created for each unique
> term.

OK.  So, first a Document is translated into array of Tokens (one per
term occurrence).  Then this array gets sorted by term text, then term
position.  Then (2nd pass) you coalesce all adjacent Tokens that share
the same term text and write each into a RawPosting.  Then this array
of RawPostings is passed to the external sorter.

In Lucene we take each token and do a hash lookup and then write the
necessary bytes immediately.  We then only sort, once, during flush.
(Except, term vectors still must sort every document's Tokens.)

> > > > OK I now see that what we call Posting really should be called
> > > > PostingList: each instance of this class, in DW, tracks all documents
> > > > that contained that term.
> > > >
> > > >
> > >
> > > I suppose that would be more accurate... though in KS, the "PostingList"
> > > class is the replacement for TermDocs, TermPositions, etc.
> > >
> >
> > I see.  Though that is really a search-time vs indexing-time
> > distinction, referencing the same thing (a "posting list").  Maybe
> > RawPostingList, to make it clearer that these are the raw bytes for
> > many docs produced during indexing?
> >
> That usage would fit within the current KS nomenclature, for whatever
> that's worth WRT Lucene. ;)
> > In Lucene, a Posting instance holds many documents.
> >
> I think renaming that class might improve accuracy a bit (I assume it's not
> public).  I've never seen "posting" defined as something which could hold
> multiple documents -- it's either one term indexing one document (the
> definition KS uses), or one instance of one term indexing one document.
> The latter meaning seems to be a little more common, but it didn't map as
> well to the class hierarchy I wanted to define.


> > In Lucene we just flush a true segment when RAM is full.  For
> > early versions of LUCENE-843 I was flushing "intermediate" segments
> > (like KS's runs), but the cost of merging these runs wasn't that much
> > better than the cost of merging normal segments, and, it added
> > unexpected cost to close(), so I fell back to just flushing normal
> > segments.
> >
> Really, you gave up on that?  Then aren't you moving stored fields and term
> vectors around more often?

Actually, no, we aren't moving those bytes around, when
autoCommit=false (which will be the only allowed value in 3.0:
autoCommit=true is now deprecated in trunk).

Instead, segments are now allowed to share these "doc stores".  Each
segment can reference another segment's doc stores, starting at its
own offset.

> > ow do multiple threads interact here?  Does each one work with its
> > own ExternalSorter, and then all threads and all of their runs do
> > merge sort at the end?  Or do multiple threads feed a single external
> > sorter?  How finely do you require thread synchronization?
> >
> KS doesn't presently work with threads -- Perl's threading model is kind of
> awkward.

Oooh, ouch: the world is quickly going multi-core/cpu.  I think thread
concurreny is very important.  But this sounds like a Perl threading
limititation, imposed on you.

> But here's the way it would probably work best:  One external
> sorter object (a PostingPoolQueue).  One sorted run object (a PostingPool)
> per thread.  Thread synchronization is required whenever a PostingPool gets
> written to temp storage, since all pools share a common OutStream.
> Synchronization is also required at finish-time during periodic calls to
> PostPoolQ_Refill(pool_q), which is when RawPosting objects are merge sorted
> from the PostingPools into a cache in the PostingPoolQueue.

OK this sounds like it'd have good concurrency.  Each thread would
accumulate many docs worth of RawPostings arrays.  Though, holding the
global lock while doing the merge sort into the PostingPoolQueue's
cache might become a bottleneck?  How often does that happen and what
net %tg of the time does it normally take?

In Lucene, each thread writes into its own private RawPosting
equivalent (stored in a thread state class), until it's time to flush
a segment, at which point we do a merge sort of all thread states
while holding the global lock.  The only other time we hold the global
lock is to write to the doc store files since those writes must be
done in docID order.

We actually lose some concurrency here: if a large doc is being
indexed by one thread, and a tiny doc by another thread, that tiny doc
can't write until the large doc finishes, and this ties up the tiny
doc's thread state such that that if that thread tries to index
another doc it will have to wait until the previous doc gets written.
The right fix here is to make a separate class to hold teh bundle of
bytes that need to be flushed.  There's no reason why the entire
thread state must be held up.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message