lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: ConcurrentMergeScheduler and MergePolicy question
Date Mon, 03 Aug 2009 19:59:36 GMT
Michael McCandless wrote:
> On the impact of search performance for large vs small mergeFactors, I
> think the jury is still out.  People should keep testing that (and
> report back!).  Certainly, for the fastest reopen time you never want
> any merging to be done :)
Here is the original exchange I referenced:

 >>On Fri, Apr 10, 2009 at 3:06 PM, Mark Miller <> 
 >>    24 segments is bound to be quite a bit slower than an optimized 
index for most things

 >I'd be curious just how true this really is (in general)... my guess
 >is the "long tail of tiny segments" gets into the OS's IO cache (as
 >long as the system stays hot) and doesn't actually hurt things much.
 >Has anyone tested this (performance of unoptimized vs optimized
 >indexes, in general) recently?  To be a fair comparison, there should
 >be no deletions in the index.

After reading that, I played with some sorting code I had and did a 
quick cheesy test or two - one segment vs a 10 or 20. In that horrible 
test (based on the stress sort code), I don't remember seeing much of a 
difference. No sorting. Very, very unscientific, quick and dirty.

This time I loaded up 1.3 million wikipedia articles, gave the test 
768MB of RAM, warmed the Searcher with lots of searching before each 
measurement, and compared 1 segment vs 5. The optimized index was 15-20% 
faster with the queries I was using (approx 100 queries targeted at 
wikipedia). Its an odd test system - Ubuntu, Quad core laptop with slow 
laptop drives and 4 gig of RAM. Still not very scientific, but better 
than before.

Here is the benchmark I was using in various forms:

{ "Rounds"


    { "Populate"
        { "MAddDocs" AddDoc > : 15000
    { "test"
        { "WarmRdrDocs" Warm > : 50
        { "WarmRdr" Search > : 5000
        { "SearchSameRdr" Search > : 50000
    } : 2

RepSumByPrefRound SearchSameRdr

I also did a quick profile for a 15k index, 1seg vs 10 segs. I profiled 
each for approx 11 million calls of readVint. The hotspot results are below.

Just a quick start at looking into this from over the weekend.

- Mark

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message