Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 55876 invoked from network); 27 Sep 2005 21:47:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 27 Sep 2005 21:47:31 -0000 Received: (qmail 64039 invoked by uid 500); 27 Sep 2005 21:47:30 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 63812 invoked by uid 500); 27 Sep 2005 21:47:29 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 63797 invoked by uid 99); 27 Sep 2005 21:47:28 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Sep 2005 14:47:28 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [12.154.210.214] (HELO rectangular.com) (12.154.210.214) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Sep 2005 14:47:34 -0700 Received: from [67.189.26.9] (helo=[10.0.1.2]) by rectangular.com with esmtpa (Exim 4.44) id 1EKNKh-000IWf-BW for java-dev@lucene.apache.org; Tue, 27 Sep 2005 14:49:51 -0700 Mime-Version: 1.0 (Apple Message framework v734) Content-Transfer-Encoding: 7bit Message-Id: <529C25D7-BFEC-4FF4-B72C-082A48E992B9@rectangular.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed To: java-dev@lucene.apache.org From: Marvin Humphrey Subject: KinoSearch merge model Date: Tue, 27 Sep 2005 14:47:04 -0700 X-Mailer: Apple Mail (2.734) X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Greets, As mentioned in my previous post, the most significant architectural difference between the Lucene/Plucene indexer and KinoSearch indexer is the merge model. KinoSearch's merge model is considerably more efficient in Perl; I suspect that it may also be incrementally more efficient in Java, though it would take a fair amount of work to find out. When this new indexer based on KinoSearch indexes a document, it writes stored fields and norms just as Lucene does. However, when the document is inverted, instead of writing a mini-inverted-index, the postings are serialized and dumped into a sort pool. The serialization algo is designed so that the sorted postings emerge from the sort pool in the order ideal for writing an index after a simple lexical sort. The concatenated components are: 1) Field number* [unsigned big-endian 16-bit integer] 2) term 3) document number [unsigned big-endian 32-bit integer] 4) positions [array (C, not Perl) of 32-bit integers] 5) term length [unsigned big-endian 16-bit integer] * It is possible to use the field number because of KinoSearch's requirement that all fields be defined in advance; fields are sorted lexically prior to the assignment of field numbers, so sorting by name and number produce identical results. After the sort is executed, the strings are fed into a while loop, where the components are pulled apart (except for field number and term, which remain united for now). freq (the number of times the term appears in the document) is determined by counting the number of elements in the positions array. doc_freq is derived by incrementing a count over subsequent loops until the field-number-plus-term string changes. Since each iteration represents one document, the doc_freq is the number of loop iters that pass by. We now have all the elements needed for writing the tii, tis, frq, and prx files. Of course, there is a major obstacle which must be overcome with this approach: you can only dump the serialized postings for a small number of documents into the sort pool before you run out of RAM. The answer is to implement an external sorting algorithm. I wrote one, and contributed it to CPAN: The KinoSearch merge model is much better for Perl because there's no way to implement the Lucene merge model without zillions of objects and method calls. The OO overhead for comparing serialized postings is much lower, as the information encoded into the strings can be compared lexically, so objects need not be rebuilt and sorted by member variable. In Java, OO overhead is less of a factor, but I suspect there are still some gains to be had. There are other advantages: norms and fields are written only once per segment (and segments are written less often) for starters. The crucial code resides at present in the write_postings() method within the PostingsWriter module. http://www.rectangular.com/cgi-bin/viewcvs.cgi/ksearch/lib/KinoSearch/ Index/PostingsWriter.pm?rev=1.9&content-type=text/vnd.viewcvs-markup I don't know whether this merge model will ever be of use to Java Lucene, but it was suggested that I bring it up in this forum after I described it on the Plucene list. I imagine that reproducing Sort::External in Java would be straightforward, if its equivalent does not already exist. Food for thought, in any case. Cheers, Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org