lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justus Pendleton <jpendle...@atlassian.com>
Subject Performance of never optimizing
Date Mon, 03 Nov 2008 03:42:52 GMT
Howdy,

I have a couple of questions regarding some Lucene benchmarking and  
what the results mean[3]. (Skip to the numbered list at the end if you  
don't want to read the lengthy exegesis :)

I'm a developer for JIRA[1]. We are currently trying to get a better  
understanding of Lucene, and our use of it, to cope with the needs of  
our larger customers. These "large" indexes are only a couple hundred  
thousand documents but our problem is compounded by the fact that they  
have a relatively high rate of modification (=delete+insert of new  
document) and our users expect these modification to show up in query  
results pretty much instantly.

Our current default behaviour is a merge factor of 4. We perform an  
optimization on the index every 4000 additions. We also perform an  
optimize at midnight. Our fundamental problem is that these  
optimizations are locking the index for unacceptably long periods of  
time, something that we want to resolve for our next major release,  
hopefully without undermining search performance too badly.

In the Lucene javadoc there is a comment, and a link to a mailing list  
discussion[2], that suggests applications such as JIRA should never  
perform optimize but should instead set their merge factor very low.

In an attempt to understand the impact of a) lowering the merge factor  
from 4 to 2 and b) never, ever optimizing on an index (over the course  
of years and millions of additions/updates) I wanted to try to  
benchmark Lucene.

I used the contrib/benchmark framework and wrote a small algorithm  
that adds documents to an index (using the Reuters doc generator),  
does a search, does an optimize, then does another search. All the  
pretty pictures can be seen at:

   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

I have several questions, hopefully they aren't overwhelming in their  
quantity :-/

1. Why does the merge factor of 4 appear to be faster than the merge  
factor of 2?

2. Why does non-optimized searching appear to be faster than optimized  
searching once the index hits ~500,000 documents?

3. There appears to be a fairly sizable performance drop across the  
board around 450,000 documents. Why is that?

4. Searching performance appears to decrease towards a fairly  
pessimistic 20 searches per second (for a relatively simple search).  
Is this really what we should expect long-term from Lucene?

5. Does my benchmark even make sense? I am far from an expert on  
benchmarking so it is possible I'm not measuring what I think I am  
measuring.

Thanks in advance for any insight you can provide. This is an area  
that we very much want to understand better as Lucene is a key part of  
JIRA's success,

Cheers,
Justus
JIRA Developer

[1]: http://www.atlassian.com
[2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
[3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message