lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Performance of never optimizing
Date Mon, 03 Nov 2008 05:27:34 GMT

Very quick comments.

----- Original Message ----
> From: Justus Pendleton <>
> To:
> Sent: Sunday, November 2, 2008 10:42:52 PM
> Subject: Performance of never optimizing
> Howdy,
> I have a couple of questions regarding some Lucene benchmarking and what the 
> results mean[3]. (Skip to the numbered list at the end if you don't want to read 
> the lengthy exegesis :)
> I'm a developer for JIRA[1]. We are currently trying to get a better 
> understanding of Lucene, and our use of it, to cope with the needs of our larger 
> customers. These "large" indexes are only a couple hundred thousand documents 
> but our problem is compounded by the fact that they have a relatively high rate 
> of modification (=delete+insert of new document) and our users expect these 
> modification to show up in query results pretty much instantly.

This will be a tough call with large indices - there is no real-time search in Lucene yet.

> Our current default behaviour is a merge factor of 4. We perform an optimization 
> on the index every 4000 additions. We also perform an optimize at midnight. Our 

I wouldn't optimize every 4000 additions - you are killing IO, rewriting the whole index,
while trying to provide fast searches, plus you are locking the index for other modifications.

> fundamental problem is that these optimizations are locking the index for 
> unacceptably long periods of time, something that we want to resolve for our 
> next major release, hopefully without undermining search performance too badly.

Why are you optimizing?  Trying to make the search faster?  I would try to avoid optimizing
during high usage periods.

> In the Lucene javadoc there is a comment, and a link to a mailing list 
> discussion[2], that suggests applications such as JIRA should never perform 
> optimize but should instead set their merge factor very low.

Right, you can let Lucene merge segments.

> In an attempt to understand the impact of a) lowering the merge factor from 4 to 
> 2 and b) never, ever optimizing on an index (over the course of years and 
> millions of additions/updates) I wanted to try to benchmark Lucene.

One thing that you might not have tried is the constant re-opening of the IndexReader, which
you'll need to do if you want to see index changes instantly.

> I used the contrib/benchmark framework and wrote a small algorithm that adds 
> documents to an index (using the Reuters doc generator), does a search, does an 
> optimize, then does another search. All the pretty pictures can be seen at:

So you indexed once and then measured search performance?  Or did you measure indexing performance?
 I can't quite tell from your email.
And in one case you optimized before searching and in the other you did not optimize?

> I have several questions, hopefully they aren't overwhelming in their quantity 
> :-/
> 1. Why does the merge factor of 4 appear to be faster than the merge factor of 
> 2?

Faster for indexing or searching?  If indexing, then it's because 4 means fewer segment merges
than 2.  If searching, then I don't know, unless you had indexing and searching happening
in parallel, which then means less IO for 4.

Did you index fit in RAM, by the way?

> 2. Why does non-optimized searching appear to be faster than optimized searching 
> once the index hits ~500,000 documents?

Not sure without seeing the index/machine.
It sounds like you were measuring search performance while at the same time increasing the
index size by incrementally adding more docs?

> 3. There appears to be a fairly sizable performance drop across the board around 
> 450,000 documents. Why is that?

Something to do with Lucene merging index segments around that point?  At this point I'm assuming
you were measuring search speed while indexing.

> 4. Searching performance appears to decrease towards a fairly pessimistic 20 
> searches per second (for a relatively simple search). Is this really what we 
> should expect long-term from Lucene?

20 reqs/sec sounds very low.  How large is your index, how much RAM, and how about heap size?
What were your queries like? random?  from log?

> 5. Does my benchmark even make sense? I am far from an expert on benchmarking so 
> it is possible I'm not measuring what I think I am measuring.

I'm confused by what exactly you did and measured, but it could just be that I'm tired.

> Thanks in advance for any insight you can provide. This is an area that we very 
> much want to understand better as Lucene is a key part of JIRA's success,

> [1]:
> [2]:
> [3]:

Sematext -- -- Lucene - Solr - Nutch

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message