lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)
Date Thu, 05 Apr 2007 17:04:56 GMT

On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:

> So you basically do not "de-dup" by field+term on your first pass
> through the tokens in the doc (which is "roughly" what that hash
> does).  Instead, append all tokens in an array, then sort first by
> field+text and second by position?  This is done for each document
> right?

Almost.  The sorting is done per-field.

Token doesn't have a "field", so comparison is cheaper than you're  
thinking.

int
Token_compare(const void *va, const void *vb)
{
     Token *const a = *((Token**)va);
     Token *const b = *((Token**)vb);

     size_t min_len = a->len < b->len ? a->len : b->len;

     int comparison = memcmp(a->text, b->text, min_len);

     if (comparison == 0) {
         if (a->len != b->len) {
             comparison = a->len < b->len ? -1 : 1;
         }
         else {
             comparison = a->pos < b->pos ? -1 : 1;
         }
     }

     return comparison;
}

> Did you every compare this approach against hash (or other de-dup data
> structure, letter trie or something) approach?

KS used to use hashing, though it wasn't directly analogous to how  
Lucene does things.  I've only tried these two techniques.  This was  
faster by about 30%, but the difference is not all in the de-duping.

KS is concerned with preparing serialized postings to feed into an  
external sorter.  In the hashing stratagem, every position added to a  
term_text => serialized_posting pair in the hash requires a string  
concatenation onto the end of serialized_posting, and thus a call to  
realloc().

Besides switching out hashing overhead for qsort overhead, the  
sorting technique also allows KS to know up front how many positions  
are associated with each posting, so the memory for the serialized  
string only has to be allocated once.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message