From java-user-return-36950-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Mon Nov 03 03:43:34 2008 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 34650 invoked from network); 3 Nov 2008 03:43:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Nov 2008 03:43:34 -0000 Received: (qmail 27975 invoked by uid 500); 3 Nov 2008 03:43:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 27951 invoked by uid 500); 3 Nov 2008 03:43:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 27940 invoked by uid 99); 3 Nov 2008 03:43:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Nov 2008 19:43:33 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [63.246.22.40] (HELO mail.atlassian.com) (63.246.22.40) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Nov 2008 03:42:14 +0000 Received: from thumbsucker.sydney.atlassian.com (unknown [203.63.130.33]) by mail.atlassian.com (Postfix) with ESMTP id C6DCC1631E84 for ; Sun, 2 Nov 2008 21:42:54 -0600 (CST) Message-Id: From: Justus Pendleton To: java-user@lucene.apache.org Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v929.2) Subject: Performance of never optimizing Date: Mon, 3 Nov 2008 14:42:52 +1100 X-Mailer: Apple Mail (2.929.2) X-Virus-Checked: Checked by ClamAV on apache.org Howdy, I have a couple of questions regarding some Lucene benchmarking and what the results mean[3]. (Skip to the numbered list at the end if you don't want to read the lengthy exegesis :) I'm a developer for JIRA[1]. We are currently trying to get a better understanding of Lucene, and our use of it, to cope with the needs of our larger customers. These "large" indexes are only a couple hundred thousand documents but our problem is compounded by the fact that they have a relatively high rate of modification (=delete+insert of new document) and our users expect these modification to show up in query results pretty much instantly. Our current default behaviour is a merge factor of 4. We perform an optimization on the index every 4000 additions. We also perform an optimize at midnight. Our fundamental problem is that these optimizations are locking the index for unacceptably long periods of time, something that we want to resolve for our next major release, hopefully without undermining search performance too badly. In the Lucene javadoc there is a comment, and a link to a mailing list discussion[2], that suggests applications such as JIRA should never perform optimize but should instead set their merge factor very low. In an attempt to understand the impact of a) lowering the merge factor from 4 to 2 and b) never, ever optimizing on an index (over the course of years and millions of additions/updates) I wanted to try to benchmark Lucene. I used the contrib/benchmark framework and wrote a small algorithm that adds documents to an index (using the Reuters doc generator), does a search, does an optimize, then does another search. All the pretty pictures can be seen at: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs I have several questions, hopefully they aren't overwhelming in their quantity :-/ 1. Why does the merge factor of 4 appear to be faster than the merge factor of 2? 2. Why does non-optimized searching appear to be faster than optimized searching once the index hits ~500,000 documents? 3. There appears to be a fairly sizable performance drop across the board around 450,000 documents. Why is that? 4. Searching performance appears to decrease towards a fairly pessimistic 20 searches per second (for a relatively simple search). Is this really what we should expect long-term from Lucene? 5. Does my benchmark even make sense? I am far from an expert on benchmarking so it is possible I'm not measuring what I think I am measuring. Thanks in advance for any insight you can provide. This is an area that we very much want to understand better as Lucene is a key part of JIRA's success, Cheers, Justus JIRA Developer [1]: http://www.atlassian.com [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895 [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org