Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 50874 invoked from network); 15 Apr 2008 21:57:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Apr 2008 21:57:53 -0000 Received: (qmail 82311 invoked by uid 500); 15 Apr 2008 21:57:51 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 82255 invoked by uid 500); 15 Apr 2008 21:57:51 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 82244 invoked by uid 99); 15 Apr 2008 21:57:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Apr 2008 14:57:51 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.38.223] (HELO rectangular.com) (68.116.38.223) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Apr 2008 21:57:05 +0000 Received: from c-76-115-21-31.hsd1.or.comcast.net ([76.115.21.31] helo=[192.168.11.50]) by rectangular.com with esmtpa (Exim 4.44) id 1JltDn-000Mdo-Ei for java-dev@lucene.apache.org; Tue, 15 Apr 2008 15:01:47 -0700 Message-Id: <79ADBC25-CDDE-4E6A-A5BE-015C8FE9F592@rectangular.com> From: Marvin Humphrey To: java-dev@lucene.apache.org In-Reply-To: <9ac0c6aa0804130228k5c64e72xd5edace887a3b8d2@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: Pooling of posting objects in DocumentsWriter Date: Tue, 15 Apr 2008 14:57:16 -0700 References: <47FB88D2.2050505@gmail.com> <24883157-DFCF-4085-B239-4302360FA325@mikemccandless.com> <7F4D259D-3778-4FC3-A793-799AB17F0260@rectangular.com> <9ac0c6aa0804100237l2672212dlcd2f12dbfe83f457@mail.gmail.com> <9ac0c6aa0804120229r332ed786o5fed403662312194@mail.gmail.com> <3DFD1709-1AD5-404F-972C-0366430C4D61@rectangular.com> <9ac0c6aa0804130228k5c64e72xd5edace887a3b8d2@mail.gmail.com> X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote: >> In fact, RawPosting "objects" aren't even malloc'd individually -- >> they're >> assigned memory chunks by a custom MemoryPool object (which *does* >> have to >> maintain awareness of alignment issues). You can't do this: > > Right, it's that alignment waste that I'd think you need to "amortize" > onto the cost of each RawPosting. Worse than the alignment waste is that it's impossible to know just how much memory is required by each RawPosting before it the export operation completes, but you have to allocate it up front. That means planning for a worst case scenario. It's analogous to packing UTF-8 from UTF-16 source. The way I hacked it was to always ask for the worst-case amount when allocating each RawPosting, but allow quick resizing via an immediate call to MemPool_Resize(). I'm perfectly comfortable with such hacks when they are contained to a small scope, but this one's not: each Posting subclass has to deal with the MemPool_Resize API or waste memory -- theoretically including public subclasses. :( But perhaps we can do better... >> Flexible arrays, which are a lesser-known but reliable hack, have >> to be the >> last element in a C struct: >> >> http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html > > Alas we can't use such neat tricks in Javaland... Heh. It works, but it would be better if it weren't there. And I think there is a solution which will be elegant both in Java and in KS. Let's consider a new private class, FlatPosting, intended to take the place of RawPosting: struct kino_FlatPosting { kino_VirtualTable *_; kino_ref_t ref; chy_u32_t doc_num; char *content; size_t content_len; char *blob; size_t blob_len; }; FlatPosting's "content" and "blob" members will be pointers into shared buffers. In Java, they would be array offsets instead. Unlike RawPosting objects, each FlatPosting is the same size -- no flexible array members. FlatPostings will be pooled within PostingPool parent objects, which will own the shared buffers. Because they are pooled and reused, allocation overhead is no longer crucial and FlatPostings can be ordinary objects. When the shared buffer gets full, the PostingPool will flush a sorted run to disk. Only PostingPool, PostingPoolQueue (the external sorter), and PostingsWriter will have access to the FlatPostings --- it's best to limit FlatPosting's visibility because managing access to the shared buffer is tricky. PostingPool will be public. Subclasses of Posting will contribute to the PostingPool via a PostPool_Add() method. void PostPool_add(PostingPool *self, u32_t doc_num, char *content, size_t content_len, char *blob, size_t blob_len); Hmm, we'll need some way to track field as well. There's a couple of ways of doing that... >> I wish RawPosting's design was a little less cryptic. However, >> malloc() >> and free(), used by the normal object allocation infrastructure in >> KS, are >> comparatively expensive. Using a custom allocation strategy speeds >> things >> up, at the cost of some code clarity -- essentially the same >> tradeoff that >> started this thread. > > Makes sense. Well, here's the thing. RawPosting works in KS and it's fast, but it's a hack -- good enough for an implementation detail, but not good enough to expose as a public API. And like Lucene's DocumentsWriter, KinoSearch's PostingsWriter is large and difficult to penetrate. What I'm hoping is that this discussion will yield some innovations that allow me to refactor this code for clarity and interface simplicity. There's still going to be a lot of code, but hopefully it can be better broken up into more coherent modules. I particularly want to see Posting as small as possible, because that's the one I'd like to be publicly subclassable. FWIW Refactoring the external sorter has been pretty successful: KS::Util::SortExternal KS::Util::SortExRun KS::Test::Util::BBSortEx extends SortExternal KS::Test::Util::BBSortExRun extends SortExRun KS::Index::PostingPoolQueue extends SortExternal KS::Index::PostingPool extends SortExRun SortExternal and SortExRun are abstract base classes. BBSortEx and BBSortExRun, which deal with ByteBufs, are test-only subclasses that air out the merging code. PostingPoolQueue and PostingPool deal with RawPosting objects. PostingPoolQueue is nice and small now, and SortExternal and SortExRun are pretty coherent and well tested. The only one left in that hierarchy that's kind of bloated and fragile is PostingPool. > I see. So Posting exposes the doc_num so the generic code > (SegPList_next) can check if the doc is deleted when iterating. Yes. It would be nice to make it an accessor, but the PList_Next() method is inner loop code. > What is self->doc_base? In Lucene, each segment doesn't "know" it's > doc base; MultiSegmentReader handles that "above" this layer. Right, I had to move that down. The reason is that Posting objects get released into the wild, and once that happens they can't ask their parent container object about the offset any more. With a TermDocs object, you can only access the doc number through the TermDocs itself... while (termDocs.next()) { int docNum = termDocs.doc(); ... } ... but with a PostingList, you can access the doc number through the Posting: while (postingList.next()) { Posting posting = postingList.getPosting(); int docNum = posting.getDocNum(); ... } > So, first a Document is translated into array of Tokens (one per > term occurrence). Then this array gets sorted by term text, then term > position. Then (2nd pass) you coalesce all adjacent Tokens that share > the same term text and write each into a RawPosting. Then this array > of RawPostings is passed to the external sorter. Yes. You know what I'd really like to do though? Kill off Token and replace it with Posting. > In Lucene we take each token and do a hash lookup and then write the > necessary bytes immediately. We then only sort, once, during flush. I'm a little confused. Are you saying you have a trick where Lucene sorts less often? What you refer to as a "second pass" in KS doesn't involve sorting (but it looks like you grok that?). Also, KS doesn't use hashing to group Tokens. We discussed this algo once before. Your reaction: OOOOOH! I like that approach! So you basically do not "de-dup" by field+term on your first pass through the tokens in the doc (which is "roughly" what that hash does). Instead, append all tokens in an array, then sort first by field+text and second by position? This is done for each document right? This seems like it could be a major win! :) > (Except, term vectors still must sort every document's Tokens.) In KS, TermVectorsWriter gets passed the already-inverted array of Tokens. This object, currently named TokenBatch but perhaps to be renamed Inversion, is what Analyzers operate on in KS. > Instead, segments are now allowed to share these "doc stores". Each > segment can reference another segment's doc stores, starting at its > own offset. Mmm, that's clever. :) I wish it didn't break the segmentation model, but it's a sensible optimization -- especially because you write segments so much more often. >> KS doesn't presently work with threads -- Perl's threading model is >> kind of >> awkward. > > Oooh, ouch: the world is quickly going multi-core/cpu. I think thread > concurreny is very important. But this sounds like a Perl threading > limititation, imposed on you. Yeah. I'd say it's a higher priority to finish the C API and get POSIX threads working than it is to get Perl threads working. > OK this sounds like it'd have good concurrency. Each thread would > accumulate many docs worth of RawPostings arrays. Though, holding the > global lock while doing the merge sort into the PostingPoolQueue's > cache might become a bottleneck? How often does that happen and what > net %tg of the time does it normally take? It happens at the end of an indexing session, after the last doc has been processed, when $invindexer->finish gets called -- the analogue to IndexWriter.close(). (; Probably I should put it into $invindexer- >prepare_commit. ;) The amount of time it takes to finish a segment varies a lot. I haven't benchmarked it rigorously, but my guess is that it averages between 5 and 25 percent of total time. > In Lucene, each thread writes into its own private RawPosting > equivalent (stored in a thread state class), until it's time to flush > a segment, at which point we do a merge sort of all thread states > while holding the global lock. Sounds like about the same thing. When a PostingPool flushes a sorted run to temp storage, it actually uses the real file format. And when it's time to merge an existing segment, a PostingPool gets assigned to it and reads it back in. So, when the external sorting apparatus goes into fetch mode, the process of recovering RawPosting/FlatPosting objects is essentially the same whether they being recovered from an existing segment or from temporary storage. > We actually lose some concurrency here: if a large doc is being > indexed by one thread, and a tiny doc by another thread, that tiny doc > can't write until the large doc finishes, and this ties up the tiny > doc's thread state such that that if that thread tries to index > another doc it will have to wait until the previous doc gets written. Writing stored fields shouldn't be CPU intensive, though. My guess is that you're i/o bound at that point. Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org