lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Benchmarkers
Date Mon, 03 Apr 2006 23:58:00 GMT

On Apr 3, 2006, at 6:57 AM, Yonik Seeley wrote:

> A couple of points:
>  - Are all the lucene variations using the same index parameters?
>    max buffered docs, index format (compound or not), mergeFactor, etc
>    I personally use non-compound index format, max buffered docs=1000,
>    mergeFactor=10

      IndexWriter writer = new IndexWriter(indexDir,
        new WhitespaceAnalyzer(), true);
+    writer.setMaxBufferedDocs(1000);
+    writer.setUseCompoundFile(false);

I'll set Lucene to use the non-compound format.  KinoSearch only  
supports the compound index format, but since it only writes one  
segment per indexing session, each file only gets rewritten once and  
that's not going to be much of a handicap.  Plucene only uses the non- 
compound format.

KinoSearch doesn't have max_buffered_docs or merge_factor settings,  
since it uses a different merge model based on external sorting and  
serialized postings.  Currently, it keeps track of the amount of  
memory consumed by the in-memory sort pool, and writes a run when  
that number hits 20 MB.  Version 0.09_02 uses its own external  
sorting routine for the first time, so I can and probably should  
adapt it use a max_buffered_docs variable, which it will need to poll  
a lot less frequently.  But that's an optimization for another day.

Plucene is a Lucene 1.3 port, so it doesn't have max_buffered_docs --  
but I can set merge_factor to 1000.

>  - reading in the file line by line probably isn't the fastest (esp
> when you just construct another big string out of it).

I'm addressing this issue in my reply to Doug.

>  - Java settings:
>    - use the 1.5 JVM if possible, it's much faster than 1.4 in my  
> experience

Interestingly, 1.5 produces slightly inferior results on my G4.  (I  
know about the command line alias snafu, BTW: <http://>).

I'll include results from both 1.4 and 1.5.  I'll also include  
results for a vanilla compile of Perl 5.8.8, which is definitely  
faster than the Perl 5.8.6 Apple ships with OS X Tiger.

>    - use "-server", it's much faster than "-client"
>    - use enough heap so too much time isn't taken in GC


Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message