lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Benchmarking results
Date Mon, 10 Apr 2006 05:29:23 GMT

On Apr 7, 2006, at 10:05 AM, Doug Cutting wrote:

 > Another axis that I don't think you're yet measuring is how things  
change as
 > the index grows.

There are lots of axes that I haven't measured yet, but I do have to  
move on
to other things sooner or later.  :)  Running decent scientific  
experiments is
painstaking work.  I have thrown away many, many hours of invalid data.

Besides, my laptop deserves a break.  I've been running it flat out  
for so
long, it thinks it's a space heater.

 > What happens with 10k, 100k and 1M and 10M documents?

There are 19043 docs in the extracted Reuters corpus, comprising a  
little over
16 MB of content.  I've set up the benchmarking apps so that if you  
specify
more than 19043 docs on the command line, it loops back through the
collection, allowing an arbitrarily large number of docs to be indexed.

Here are results for 100,000 docs:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------ 
--
Lucene / JVM 1.4                       356 secs              224 MB
KinoSearch / Perl 5.8.8                407 secs               30 MB

... and 1 million docs:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------ 
--
Lucene / JVM 1.4                      4013                  skipped
KinoSearch / Perl 5.8.8               4776                  skipped

I'll try to get the 10 million benchmark done, but it's going to be a  
little
tough, as I'm running these on my main working laptop.  I can predict  
the
trend, though, as we pile more and more docs on KinoSearch 0.09_03.

At 19000 docs and a mem_threshold of 16 MB, the external sorting  
algorithm
flushes 6 times, so 6 sorted runs must be merged when the postings  
sort pool
gets sorted and the tis/tii/frq/prx files get written.  At 1 million  
docs
and a mem_threshold of 16 MB, the flush happens somewhere in the  
neighborhood
of 300 times, so c. 300 sorted runs must be merged.

At 10 million docs and a mem_threshold of 16 MB, we should see  
something like
3000 sorted runs.  That's probably too many, and if it isn't,  
eventually it
will be.

When the external sorter flips from feeding to fetching mode, it  
establishes a
buffer for each run.  In KinoSearch 0.09_03, each buffer is allowed  
to consume
memory according to this formula:

     sortex->run_cache_limit = (sortex->mem_threshold / 2) / sortex- 
 >num_runs;

When there's 6 runs, each buffer can eat 1.33 MB, but when there's  
300 runs,
each buffer only gets around 25 kB.

The primary problem with external sorting algorithms that use an N- 
way merge
is that they tend to be IO bound [1, 2]: between recovering data from  
the sorted
runs and writing to the final output, the disk has to seek all over  
the place.
Therefore, it's important that the each run's buffer be fairly  
generous, to
minimize the number of refills.

I'll wager that at 3000 runs and 2.5 kB buffers, the external sorter is
going to seize up.  Fortunately there's an easy solution:

Use more than 16 MB of ram.

Earlier versions of KinoSearch used a different external sorter which  
set a
fixed buffer size of 32 kB of content.  I'll fix 0.09 before the  
official
release to set a floor of 64 kB ram usage for the buffers, and sometime
hence consider how to expose mem_threshold in the public API.

 > There are typically knees in search engine performance curves when
 > indexes get substantially larger than RAM, and behaviour on either
 > side of the knee may differ with different indexing strategies.

As far as KinoSearch is concerned, the index was substantially larger
than RAM when it was only 19000 docs, so we're well past the knee.

Arguably, these tests have been comparing apples and crabapples,  
because I
haven't forced KinoSearch to use more RAM.  As mentioned earlier, the  
rough
analogue to maxMergeDocs is the RAM that the in-memory sort pool is  
allowed
to consume before a run must be written to disk -- 16 MB, by  
default.  I can
set it higher, but that's not part of the public API, so it seemed like
cheating to do so.  Here's how things shake out when KinoSearch is  
hacked to
use maximum RAM...

19043 docs, body neither stored nor vectorized:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------ 
--
Lucene / JVM 1.4                      43.68 secs                79 MB
KinoSearch / Perl 5.8.8               63.89 secs               230 MB

19043 docs, body stored and vectorized:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------ 
--
KinoSearch / Perl 5.8.8               69.12 secs               236 MB
Lucene / JVM 1.4                      71.96 secs               118 MB

Perhaps in my desire to give Lucene every advantage, I've gone  
overboard and
hobbled KinoSearch too much.  My main interest is in comparing  
algorithms, and
in retrospect it's a more realistic comparison if I up KinoSearch's RAM,
private API hack or no.  However, I knew that at least some of the  
people
reading these benchmarking reports would see "Cage Match! Lucene vs.
KinoSearch!" and not pay attention to the subtler, more interesting  
qualitative
differences: where one or the other does better and why.  It seemed  
important
to take the highest high road so that no controversy could ever arise  
as to
whether one engine had been given a leg up.  I'm just not worried  
about the
speed that KinoSearch is ultimately going to attain.  I'm intimately
acquainted with where the bottlenecks tend to occur in Perl, but  
KinoSearch is
not a pure Perl library -- it's an extension to Perl written in Perl  
and C,
with some of the performance-critical work done by C.  (Call it a "C  
library
with a Perl interface", if you like -- that's close enough.) This  
approach has
its drawbacks, but now that after much experimentation and effort all  
or nearly
all of the algorithmic issues have been solved, speed isn't one of  
them.  There
are numerous optimizations still to be done, so it's going to get  
faster.

The interesting thing to me about the unlimited-memory results above  
is how the
results change when there's stored and vectorized data.  A tricked- 
out Lucene
1.9.1 indexer seems to outperform a tricked-out KinoSearch 0.09_03  
indexer with
regards to inverting documents.  However, when there's stored/ 
vectorized data,
Lucene appears to have some overhead that KinoSearch doesn't.   
Corroborating
data from additional independent experiments would be nice to have,  
but I think
at this point we have enough to entertain the hypothesis that the  
KinoSearch
merge model, considered in isolation, is simply a more efficient  
algorithm than
the current Lucene merge model.  I suspect that were Lucene to adopt it,
indexing speed would improve.

   * The store field data and term vectors data are only written  
once, rather
     than shuffled around with each merge.
   * The serialized postings are plain old byte strings which can be  
compared
     using memcmp, so there's no OO overhead for the comparison routine.
   * The streams in an N-way merge are more predictable and therefore  
easier to
     optimize for IO efficiency.
   * External sorting is a well-studied problem and existing  
scholarship can be
     leveraged.

That's my two cents, and my contribution.

Thanks for the critique,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] "Parallel Out-of-Core Sorting: The Third Way", Geeta Chaudhry
     Section 8.1, "Merging based algorithms"
     http://www.cs.dartmouth.edu/reports/abstracts/TR2004-517/
[2] "Asynchronous Parallel Disk Sorting", Dementiev/Sanders
     Section 2.2 "Multi-way merging"
     http://i10www.ira.uka.de/dementiev/files/DS03.pdf

RAW DATA - 100,000 docs
==================================================

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M - 
XX:CompileThreshold=100 LuceneIndexer -docs 100000 -reps 6
---------------------------------------------------
1   Secs: 384.62  Docs: 100000
2   Secs: 359.83  Docs: 100000
3   Secs: 354.13  Docs: 100000
4   Secs: 354.63  Docs: 100000
5   Secs: 355.56  Docs: 100000
6   Secs: 354.18  Docs: 100000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 360.49 secs
Truncated mean (4 kept, 2 discarded): 356.05 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/ 
benchmarks/
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/ 
perl -Mblib indexers/kinosearch_indexer.plx --docs=100000 --reps=6
------------------------------------------------------------
1    Secs: 406.87  Docs: 100000
2    Secs: 409.73  Docs: 100000
3    Secs: 405.33  Docs: 100000
4    Secs: 407.35  Docs: 100000
5    Secs: 406.76  Docs: 100000
6    Secs: 409.45  Docs: 100000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 407.58 secs
Truncated mean (4 kept, 2 discarded): 407.61 secs
------------------------------------------------------------

RAW DATA - 1,000,000 docs
==================================================

slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/ 
perl -Mblib indexers/kinosearch_indexer.plx --docs=1000000 --reps=6;  
cd ~/Desktop/ks/t/benchmarks/; java -server -Xmx500M - 
XX:CompileThreshold=100 LuceneIndexer -docs 1000000 -reps 6
------------------------------------------------------------
1    Secs: 4590.71  Docs: 1000000
2    Secs: 4694.36  Docs: 1000000
3    Secs: 4976.30  Docs: 1000000
4    Secs: 4760.99  Docs: 1000000
5    Secs: 4801.72  Docs: 1000000
6    Secs: 4834.97  Docs: 1000000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 4776.51 secs
Truncated mean (4 kept, 2 discarded): 4773.01 secs
------------------------------------------------------------
---------------------------------------------------
1   Secs: 4,023.33  Docs: 1000000
2   Secs: 4,003.18  Docs: 1000000
3   Secs: 4,012.21  Docs: 1000000
4   Secs: 4,015.97  Docs: 1000000
5   Secs: 4,010.61  Docs: 1000000
6   Secs: 4,013.39  Docs: 1000000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 4,013.12 secs
Truncated mean (4 kept, 2 discarded): 4,013.04 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$


RAW DATA -- KinoSearch large RAM
================================

slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/ 
perl -Mblib indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 64.86  Docs: 19043
2    Secs: 68.95  Docs: 19043
3    Secs: 67.98  Docs: 19043
4    Secs: 69.92  Docs: 19043
5    Secs: 71.56  Docs: 19043
6    Secs: 69.63  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 68.82 secs
Truncated mean (4 kept, 2 discarded): 69.12 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/ 
perl -Mblib indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 60.00  Docs: 19043
2    Secs: 63.35  Docs: 19043
3    Secs: 66.77  Docs: 19043
4    Secs: 63.82  Docs: 19043
5    Secs: 64.04  Docs: 19043
6    Secs: 64.37  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 63.72 secs
Truncated mean (4 kept, 2 discarded): 63.89 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message