lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: improve how IndexWriter uses RAM to buffer added documents
Date Thu, 05 Apr 2007 04:14:41 GMT

On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:

>> (: Ironically, the numbers for Lucene on that page are a little
>> better than they should be because of a sneaky bug.  I would have
>> made updating the results a priority if they'd gone the other  
>> way.  :)
> Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
> Ferret and others?).

Doing honest, rigorous benchmarking is exacting and labor-intensive.   
Publishing results tends to ignite flame wars I don't have time for.

The main point that I wanted to make with that page was that KS was a  
lot faster than Plucene, and that it was in Lucene's ballpark.   
Having made that point, I've moved on.  The benchmarking code is  
still very useful for internal development and I use it frequently.

At some point I would like to port the benchmarking work that has  
been contributed to Lucene of late, but I'm waiting for that code  
base to settle down first.  After that happens, I'll probably make a  
pass and publish some results.  Better to spend the time preparing  
one definitive presentation than to have to rebut every idiot's  
latest wildly inaccurate shootout.

>> ... However, Lucene has been tuned by an army of developers over the
>> years, while KS is young yet and still had many opportunities for
>> optimization.  Current svn trunk for KS is about twice as fast for
>> indexing as when I did those benchmarking tests.
> Wow, that's an awesome speedup!

The big bottleneck for KS has been its Tokenizer class.  There's only  
one such class in KS, and it's regex-based.  A few weeks ago, I  
finally figured out how to hook it into Perl's regex engine at the C  
level.  The regex engine is not an official part of Perl's C API, so  
I wouldn't do this if I didn't have to, but the tokenizing loop is  
only about 100 lines of code and the speedup is dramatic.

I've also squeezed out another 30-40% by changing the implementation  
in ways which have gradually winnowed down the number of malloc()  
calls.  Some of the techniques may be applicable to Lucene; I'll get  
around to firing up JIRA issues describing them someday.

> So KS is faster than Lucene today?

I haven't tested recent versions of Lucene.  I believe that the  
current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
But... A) I don't have an official release out with the current  
Tokenizer code, B) I have no immediate plans to prepare further  
published benchmarks, and C) it's not really important, because so  
long as the numbers are close you'd be nuts to choose one engine or  
the other based on that criteria rather than, say, what language your  
development team speaks.  KinoSearch scales to multiple machines, too.

Looking to the future, I wouldn't be surprised if Lucene edged ahead  
and stayed slightly ahead speed-wise, because I'm prepared to make  
some sacrifices for the sake of keeping KinoSearch's core API simple  
and the code base as small as possible.  I'd rather maintain a  
single, elegant, useful, flexible, plenty fast regex-based Tokenizer  
than the slew of Tokenizers Lucene offers, for instance.  It might be  
at a slight disadvantage going mano a mano against Lucene's  
WhiteSpaceTokenizer, but that's fine.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message