Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-Id: <79ADBC25-CDDE-4E6A-A5BE-015C8FE9F592@rectangular.com>
From: Marvin Humphrey <marvin@rectangular.com>
To: java-dev@lucene.apache.org
In-Reply-To: <9ac0c6aa0804130228k5c64e72xd5edace887a3b8d2@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v919.2)
Subject: Re: Pooling of posting objects in DocumentsWriter
Date: Tue, 15 Apr 2008 14:57:16 -0700
References: <47FB88D2.2050505@gmail.com>
 <24883157-DFCF-4085-B239-4302360FA325@mikemccandless.com>
 <7F4D259D-3778-4FC3-A793-799AB17F0260@rectangular.com>
 <9ac0c6aa0804100237l2672212dlcd2f12dbfe83f457@mail.gmail.com>
 <A59DD111-E7D7-4C3E-87FB-2351D2AFABEE@rectangular.com>
 <9ac0c6aa0804120229r332ed786o5fed403662312194@mail.gmail.com>
 <3DFD1709-1AD5-404F-972C-0366430C4D61@rectangular.com>
 <9ac0c6aa0804130228k5c64e72xd5edace887a3b8d2@mail.gmail.com>


On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote:

>> In fact, RawPosting "objects" aren't even malloc'd individually --  
>> they're
>> assigned memory chunks by a custom MemoryPool object (which *does*  
>> have to
>> maintain awareness of alignment issues).  You can't do this:
>
> Right, it's that alignment waste that I'd think you need to "amortize"
> onto the cost of each RawPosting.

Worse than the alignment waste is that it's impossible to know just  
how much memory is required by each RawPosting before it the export  
operation completes, but you have to allocate it up front.  That means  
planning for a worst case scenario.  It's analogous to packing UTF-8  
from UTF-16 source.

The way I hacked it was to always ask for the worst-case amount when  
allocating each RawPosting, but allow quick resizing via an immediate  
call to MemPool_Resize().

I'm perfectly comfortable with such hacks when they are contained to a  
small scope, but this one's not: each Posting subclass has to deal  
with the MemPool_Resize API or waste memory -- theoretically including  
public subclasses.  :(  But perhaps we can do better...

>> Flexible arrays, which are a lesser-known but reliable hack, have  
>> to be the
>> last element in a C struct:
>>
>>  http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
>
> Alas we can't use such neat tricks in Javaland...

Heh.  It works, but it would be better if it weren't there.  And I  
think there is a solution which will be elegant both in Java and in  
KS.  Let's consider a new private class, FlatPosting, intended to take  
the place of RawPosting:

   struct kino_FlatPosting {
      kino_VirtualTable    *_;
      kino_ref_t            ref;
      chy_u32_t             doc_num;
      char                 *content;
      size_t                content_len;
      char                 *blob;
      size_t                blob_len;
   };

FlatPosting's "content" and "blob" members will be pointers into  
shared buffers.  In Java, they would be array offsets instead.  Unlike  
RawPosting objects, each FlatPosting is the same size -- no flexible  
array members.

FlatPostings will be pooled within PostingPool parent objects, which  
will own the shared buffers.  Because they are pooled and reused,  
allocation overhead is no longer crucial and FlatPostings can be  
ordinary objects.  When the shared buffer gets full, the PostingPool  
will flush a sorted run to disk.

Only PostingPool, PostingPoolQueue (the external sorter), and  
PostingsWriter will have access to the FlatPostings --- it's best to  
limit FlatPosting's visibility because managing access to the shared  
buffer is tricky.

PostingPool will be public.  Subclasses of Posting will contribute to  
the PostingPool via a PostPool_Add() method.

   void
   PostPool_add(PostingPool *self, u32_t doc_num,
                char *content, size_t content_len,
                char *blob, size_t blob_len);

Hmm, we'll need some way to track field as well.  There's a couple of  
ways of doing that...

>> I wish RawPosting's design was a little less cryptic.  However,  
>> malloc()
>> and free(), used by the normal object allocation infrastructure in  
>> KS, are
>> comparatively expensive.  Using a custom allocation strategy speeds  
>> things
>> up, at the cost of some code clarity -- essentially the same  
>> tradeoff that
>> started this thread.
>
> Makes sense.

Well, here's the thing.  RawPosting works in KS and it's fast, but  
it's a hack -- good enough for an implementation detail, but not good  
enough to expose as a public API. And like Lucene's DocumentsWriter,  
KinoSearch's PostingsWriter is large and difficult to penetrate.

What I'm hoping is that this discussion will yield some innovations  
that allow me to refactor this code for clarity and interface  
simplicity.  There's still going to be a lot of code, but hopefully it  
can be better broken up into more coherent modules.

I particularly want to see Posting as small as possible, because  
that's the one I'd like to be publicly subclassable.

FWIW Refactoring the external sorter has been pretty successful:

    KS::Util::SortExternal
    KS::Util::SortExRun
    KS::Test::Util::BBSortEx extends SortExternal
    KS::Test::Util::BBSortExRun extends SortExRun
    KS::Index::PostingPoolQueue extends SortExternal
    KS::Index::PostingPool extends SortExRun

SortExternal and SortExRun are abstract base classes.  BBSortEx and  
BBSortExRun, which deal with ByteBufs, are test-only subclasses that  
air out the merging code.  PostingPoolQueue and PostingPool deal with  
RawPosting objects.

PostingPoolQueue is nice and small now, and SortExternal and SortExRun  
are pretty coherent and well tested.  The only one left in that  
hierarchy that's kind of bloated and fragile is PostingPool.

> I see.  So Posting exposes the doc_num so the generic code
> (SegPList_next) can check if the doc is deleted when iterating.

Yes.  It would be nice to make it an accessor, but the PList_Next()  
method is inner loop code.

> What is self->doc_base?  In Lucene, each segment doesn't "know" it's
> doc base; MultiSegmentReader handles that "above" this layer.

Right, I had to move that down.  The reason is that Posting objects  
get released into the wild, and once that happens they can't ask their  
parent container object about the offset any more.  With a TermDocs  
object, you can only access the doc number through the TermDocs  
itself...

   while (termDocs.next()) {
     int docNum = termDocs.doc();
     ...
   }

... but with a PostingList, you can access the doc number through the  
Posting:

   while (postingList.next()) {
     Posting posting = postingList.getPosting();
     int docNum = posting.getDocNum();
     ...
   }

> So, first a Document is translated into array of Tokens (one per
> term occurrence).  Then this array gets sorted by term text, then term
> position.  Then (2nd pass) you coalesce all adjacent Tokens that share
> the same term text and write each into a RawPosting.  Then this array
> of RawPostings is passed to the external sorter.

Yes.

You know what I'd really like to do though?  Kill off Token and  
replace it with Posting.

> In Lucene we take each token and do a hash lookup and then write the
> necessary bytes immediately.  We then only sort, once, during flush.

I'm a little confused.  Are you saying you have a trick where Lucene  
sorts less often?  What you refer to as a "second pass" in KS doesn't  
involve sorting (but it looks like you grok that?).  Also, KS doesn't  
use hashing to group Tokens.

We discussed this algo once before.

<http://www.archivum.info/java-dev@lucene.apache.org/2007-04/msg00091.html 
 >

Your reaction:

   OOOOOH!  I like that approach!

   So you basically do not "de-dup" by field+term on your first pass
   through the tokens in the doc (which is "roughly" what that hash
   does).  Instead, append all tokens in an array, then sort first by
   field+text and second by position?  This is done for each document
   right?

   This seems like it could be a major win!

:)

> (Except, term vectors still must sort every document's Tokens.)

In KS, TermVectorsWriter gets passed the already-inverted array of  
Tokens.  This object, currently named TokenBatch but perhaps to be  
renamed Inversion, is what Analyzers operate on in KS.

> Instead, segments are now allowed to share these "doc stores".  Each
> segment can reference another segment's doc stores, starting at its
> own offset.

Mmm, that's clever. :)  I wish it didn't break the segmentation model,  
but it's a sensible optimization -- especially because you write  
segments so much more often.

>> KS doesn't presently work with threads -- Perl's threading model is  
>> kind of
>> awkward.
>
> Oooh, ouch: the world is quickly going multi-core/cpu.  I think thread
> concurreny is very important.  But this sounds like a Perl threading
> limititation, imposed on you.

Yeah.  I'd say it's a higher priority to finish the C API and get  
POSIX threads working than it is to get Perl threads working.

> OK this sounds like it'd have good concurrency.  Each thread would
> accumulate many docs worth of RawPostings arrays.  Though, holding the
> global lock while doing the merge sort into the PostingPoolQueue's
> cache might become a bottleneck?  How often does that happen and what
> net %tg of the time does it normally take?

It happens at the end of an indexing session, after the last doc has  
been processed, when $invindexer->finish gets called -- the analogue  
to IndexWriter.close().  (; Probably I should put it into $invindexer- 
 >prepare_commit. ;)

The amount of time it takes to finish a segment varies a lot.  I  
haven't benchmarked it rigorously, but my guess is that it averages  
between 5 and 25 percent of total time.

> In Lucene, each thread writes into its own private RawPosting
> equivalent (stored in a thread state class), until it's time to flush
> a segment, at which point we do a merge sort of all thread states
> while holding the global lock.

Sounds like about the same thing.

When a PostingPool flushes a sorted run to temp storage, it actually  
uses the real file format.  And when it's time to merge an existing  
segment, a PostingPool gets assigned to it and reads it back in.   So,  
when the external sorting apparatus goes into fetch mode, the process  
of recovering RawPosting/FlatPosting objects is essentially the same  
whether they being recovered from an existing segment or from  
temporary storage.

> We actually lose some concurrency here: if a large doc is being
> indexed by one thread, and a tiny doc by another thread, that tiny doc
> can't write until the large doc finishes, and this ties up the tiny
> doc's thread state such that that if that thread tries to index
> another doc it will have to wait until the previous doc gets written.

Writing stored fields shouldn't be CPU intensive, though.  My guess is  
that you're i/o bound at that point.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org