lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge
Date Fri, 23 Mar 2007 19:52:20 GMT

"Grant Ingersoll" <> wrote:

> Your timing is ironic.  I was just running some benchmarks for  
> ApacheCon (using contrib/benchmarker) and noticed what I think are  
> similar happenings, so maybe you can validate my assumptions.  I'm  
> not sure if it is because I'm hitting RAM issues or not.
> Below is the algorithm file for use w/ benchmarker.  To run it, save  
> the file, cd into contrib/benchamarker (make sure you get the lastest  
> commits) and run
> ant run-task -Dtask.mem=XXXXm -Dtask.alg=<path to file>
> The basic idea is, there are ~21580 docs in the Reuters, so I wanted  
> to run some experiments around them with different merge factors and  
> max.buffered.  Granted, some of the factors are ridiculous, but I  
> wanted to look at these a bit b/c you see people on the user list  
> from time to time talking/asking about setting really high numbers  
> for mergeFactor and maxBufferedDocs.
> The sweet spot on my machine seems to be mergeFactor == 100,  
> maxBD=1000.  I ran with -Dtask.mem=1024M on a machine with 2gb of  
> RAM.  If I am understanding the numbers correctly, and what you are  
> arguing, this sweet spot happens to coincide approximately with the  
> amount of memory I gave the process.  I probably could play a little  
> bit more with options to reach the inflection point.  So, to some  
> extent, I think your approach for RAM based modeling is worth pursuing.

Interesting results!  Because an even higher maxBufferedDocs (10000 =
299.1 rec/s and 21580 = 271.9 rec/s, @ mergeFactor=100) gave you worse
performance even though they were able to complete (meaning you had
enough RAM to buffer all those docs).  Perhaps this is because GC had
to work harder?  So it seems like at some point the benefits of giving
more RAM taper off.

Also, one caveat: whenever #docs (21578 for Reuters) divided by
maxBuffered docs is less than mergeFactor, you will have no merges
take place during your runs.  This greatly skews the results.

I'm also struggling with this on LUCENE-843 because it makes it far
harder to do apples to apples comparison.  With the patch for
LUCENE-843, many more docs can be buffered into a given fixed MB RAM.
So then it flushes less often and may hit no merges (when the baseline
Lucene trunk does hit merges), or the opposite: it may hit a massive
large merge close to the end when the baseline Lucene trunk did a few
small merges.  We sort of need some metric that can normalize for how
much "merge servicing" took place during a run, or something.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message