From Marvin Humphrey <>
Subject Re: Benchmarking results
Date Fri, 07 Apr 2006 00:25:53 GMT

On Apr 4, 2006, at 10:23 AM, Tatu Saloranta wrote:
> So in this case, what would give more comparable results (assuming
> you are interested in measuring likely server-side
> usage scenario, which is usually what Lucene is used for)

My main interest with these tests is algorithmic performance.  How  
much time it takes to start up or warm up a JVM isn't something I  
want to be measuring.  There are startup issues I'm concerned about,  
but they mostly relate to file format design.  The load time for  
field norms is a significant concern.  So is the IndexInterval, which  
is set to 1024 by default instead of 128 as in Lucene.  So is the  
locality of reference issue for where the term vector data gets  
stored.  All of those things affect the total time it takes for a  
KinoSearch app to launch, load, search, and return results, which  
needs to be as small as possible so that e.g. website search apps  
indexing up to [some large number of] documents can be run as simple  
CGI scripts.  I'm considering further modifications to the file  
format to keep that total time down...

Actually, I think the benchmark results illustrate that everyone  
should be at least mildly concerned about where the Term Vector data  
gets stored.  KinoSearch only writes that data once.  Lucene,  
however, has to read/write that data during each merge, and the more  
streams you have, the more complex the merge.  It stands to reason  
that storing term vector data with the stored fields data would speed  
up the merge process.

I brought this issue up a few weeks ago, but in a search-time  
context.  The two primary applications for Term Vector data that I am  
aware of are excerpting/highlighting and "more like this" searches,  
both of which would benefit from having the term vectors stored with  
the documents, because each search would require fewer disk seeks.    
Term Vectors might also be used to build a pure vector space search  
engine, like the one described in this article < 
pub/a/2003/02/19/engine.html>, but that's impractical for indexes  
larger than a handful of documents and of academic interest only.   
Are there any other significant applications?  If not, I submit that  
term vectors belong in the .fdx file.

> would be to run all runs within same JVM / execution (for Perl),

Thanks for the critique.  I've updated the indexer apps to accept two  
command line arguments.  They're now run like so:

     java [ARGS] LuceneIndexer -reps 6 -docs 1000
     perl indexers/kinosearch_indexer.plx --reps=6 --docs=1000

With the new methodology, the numbers are slightly better for  
Lucene.  They're actually worse for KinoSearch.   I've isolated the  
code that's responsible for the slowdown that and I speculate that  
it's a memory fragmentation issue, as I can solve it by forcing  
KinoSearch to consume more memory at that point.  However, having  
established that KinoSearch is in Lucene's league with regards to  
indexing speed, I'm not worried about absolute numbers, and the new  
benchmarker interface is slightly more stable, allowing more accurate  
comparative analysis of algorithmic efficiency.  The trends are still  
apparent: KinoSearch gains ground when there's stored and vectorized  

Raw data is below.

> and either take the fastest runs, or discard the first one and take  
> median or
> average.

As you'll see in the raw data, the apps now produce two aggregate  
numbers: a mean, and a truncated mean < 

> ps. Regarding memory usage: it is also quite tricky to measure
>  reliably, since Garbage Collection only kicks in when it has to...
>  so Java uses as much memory as it can (without expanding heap)...
>  plus, JVMs do not necessarily (or even usually) return unused
>  chunks later on.

Yes.  Still, there is a correlation  between maxBufferedDocs and max  
memory consumption by the process.  So Java must be reusing something...

     maxBufferedDocs   max memory (1 rep)   truncated mean time (6 reps)
         10                69 MB                124.89 secs
        100                91 MB                 88.17 secs
       1000               169 MB                 84.80 secs

Marvin Humphrey
Rectangular Research

RAW DATA - JVM warmup / truncated mean experiment

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M - 
XX:CompileThreshold=100 LuceneIndexer -reps 6
1   Secs: 87.02  Docs: 19043
2   Secs: 84.56  Docs: 19043
3   Secs: 85.04  Docs: 19043
4   Secs: 83.83  Docs: 19043
5   Secs: 84.75  Docs: 19043
6   Secs: 84.84  Docs: 19043
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 85.01 secs
Truncated mean (4 kept, 2 discarded): 84.80 secs
slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/ 
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/ 
perl -Mblib indexers/kinosearch_indexer.plx --reps 6
1    Secs: 75.51  Docs: 19043
2    Secs: 80.79  Docs: 19043
3    Secs: 81.12  Docs: 19043
4    Secs: 84.68  Docs: 19043
5    Secs: 81.78  Docs: 19043
6    Secs: 79.65  Docs: 19043
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.5.0 Power Macintosh
Mean: 80.59 secs
Truncated mean (4 kept, 2 discarded): 80.83 secs
slothbear:~/Desktop/ks588/t/benchmarks marvin$

RAW DATA - mergefactor experiment

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M - 
XX:CompileThreshold=100 LuceneIndexer -reps 6
1   Secs: 127.05  Docs: 19043
2   Secs: 125.50  Docs: 19043
3   Secs: 125.44  Docs: 19043
4   Secs: 124.53  Docs: 19043
5   Secs: 124.10  Docs: 19043
6   Secs: 121.57  Docs: 19043
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 124.70 secs
Truncated mean (4 kept, 2 discarded): 124.89 secs
slothbear:~/Desktop/ks/t/benchmarks marvin$ vim indexers/
slothbear:~/Desktop/ks/t/benchmarks marvin$ javac -d . indexers/
slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M - 
XX:CompileThreshold=100 LuceneIndexer -reps 6
1   Secs: 89.91  Docs: 19043
2   Secs: 87.59  Docs: 19043
3   Secs: 88.51  Docs: 19043
4   Secs: 88.59  Docs: 19043
5   Secs: 87.97  Docs: 19043
6   Secs: 86.75  Docs: 19043
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 88.22 secs
Truncated mean (4 kept, 2 discarded): 88.17 secs
slothbear:~/Desktop/ks/t/benchmarks marvin$

