lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject KinoSearch merge model
Date Tue, 27 Sep 2005 21:47:04 GMT
Greets,

As mentioned in my previous post, the most significant architectural  
difference between the Lucene/Plucene indexer and KinoSearch indexer  
is the merge model.  KinoSearch's merge model is considerably more  
efficient in Perl; I suspect that it may also be incrementally more  
efficient in Java, though it would take a fair amount of work to find  
out.

When this new indexer based on KinoSearch indexes a document, it  
writes stored fields and norms just as Lucene does.  However, when  
the document is inverted, instead of writing a mini-inverted-index,  
the postings are serialized and dumped into a sort pool.

The serialization algo is designed so that the sorted postings emerge  
from the sort pool in the order ideal for writing an index after a  
simple lexical sort.  The concatenated components are:

     1) Field number* [unsigned big-endian 16-bit integer]
     2) term
     3) document number [unsigned big-endian 32-bit integer]
     4) positions [array (C, not Perl) of 32-bit integers]
     5) term length [unsigned big-endian 16-bit integer]

     * It is possible to use the field number because of
       KinoSearch's requirement that all fields be defined
       in advance; fields are sorted lexically prior to the
       assignment of field numbers, so sorting by name and
       number produce identical results.

After the sort is executed, the strings are fed into a while loop,  
where the components are pulled apart (except for field number and  
term, which remain united for now).

freq (the number of times the term appears in the document) is  
determined by counting the number of elements in the positions array.

doc_freq is derived by incrementing a count over subsequent loops  
until the field-number-plus-term string changes.  Since each  
iteration represents one document, the doc_freq is the number of loop  
iters that pass by.

We now have all the elements needed for writing the tii, tis, frq,  
and prx files.

Of course, there is a major obstacle which must be overcome with this  
approach: you can only dump the serialized postings for a small  
number of documents into the sort pool before you run out of RAM.   
The answer is to implement an external sorting algorithm.  I wrote  
one, and contributed it to CPAN: <http://search.cpan.org/search? 
query=sort+external>

The KinoSearch merge model is much better for Perl because there's no  
way to implement the Lucene merge model without zillions of objects  
and method calls.  The OO overhead for comparing serialized postings  
is much lower, as the information encoded into the strings can be  
compared lexically, so objects need not be rebuilt and sorted by  
member variable.

In Java, OO overhead is less of a factor, but I suspect there are  
still some gains to be had.  There are other advantages: norms and  
fields are written only once per segment (and segments are written  
less often) for starters.

The crucial code resides at present in the write_postings() method  
within the PostingsWriter module.

http://www.rectangular.com/cgi-bin/viewcvs.cgi/ksearch/lib/KinoSearch/ 
Index/PostingsWriter.pm?rev=1.9&content-type=text/vnd.viewcvs-markup

I don't know whether this merge model will ever be of use to Java  
Lucene, but it was suggested that I bring it up in this forum after I  
described it on the Plucene list.  I imagine that reproducing  
Sort::External in Java would be straightforward, if its equivalent  
does not already exist.

Food for thought, in any case.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message