Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 16132 invoked from network); 5 Apr 2007 16:43:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Apr 2007 16:43:55 -0000 Received: (qmail 22592 invoked by uid 500); 5 Apr 2007 16:44:00 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 22545 invoked by uid 500); 5 Apr 2007 16:44:00 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 22534 invoked by uid 99); 5 Apr 2007 16:44:00 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2007 09:44:00 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2007 09:43:52 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7B6AB714065 for ; Thu, 5 Apr 2007 09:43:32 -0700 (PDT) Message-ID: <68662.1175791412501.JavaMail.jira@brutus> Date: Thu, 5 Apr 2007 09:43:32 -0700 (PDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-856) Optimize segment merging In-Reply-To: <29717292.1175722292226.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487049 ] Michael McCandless commented on LUCENE-856: ------------------------------------------- OK I re-ran the above test (10 MM docs @ ~5,500 bytes plain text each) with autoCommit=false: this time it took 5 hrs 7 minutes, which is 40.7% faster than the autoCommit=true test above. Both of these tests were run with the patch from LUCENE-843. So this means, if all you need to do is build a massive index with term vector positions & offsets, the fastest way to do so is with the patch from LUCENE-843 and with autoCommit=false with your writer. Basically LUCENE-843 makes autoCommit=false quite a bit faster for a very large index, assuming you are storing term vectors / stored fields. Still, I think optimizing segment merging is important because for many uses of Lucene, the "interactivity" (how quickly a searcher sees the recently indexed documents) is very important. For such cases you should open a writer with autoCommit=false and then periodically close & re-open it to publish the indexed documents to the searchers. With that model, segment merging will still be a factor slowing down indexing (though how much of a factor depends on how often you close/open your writers). > Optimize segment merging > ------------------------ > > Key: LUCENE-856 > URL: https://issues.apache.org/jira/browse/LUCENE-856 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > > With LUCENE-843, the time spent indexing documents has been > substantially reduced and now the time spent merging is a sizable > portion of indexing time. > I ran a test using the patch for LUCENE-843, building an index of 10 > million docs, each with ~5,500 byte plain text, with term vectors > (positions + offsets) on and with 2 small stored fields per document. > RAM buffer size was 32 MB. I didn't optimize the index in the end, > though optimize speed would also improve if we optimize segment > merging. Index size is 86 GB. > Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes > of which was spent merging. That's 65.6% of the time! > Most of this time is presumably IO which probably can't be reduced > much unless we improve overall merge policy and experiment with values > for mergeFactor / buffer size. > These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO > system is RAID 0 of 4 drives, so, these times are probably better than > the more common case of a single hard drive which would likely be > slower IO. > I think there are some simple things we could do to speed up merging: > * Experiment with buffer sizes -- maybe larger buffers for the > IndexInputs used during merging could help? Because at a default > mergeFactor of 10, the disk heads must do alot of seeking back and > forth between these 10 files (and then to the 11th file where we > are writing). > * Use byte copying when possible, eg if there are no deletions on a > segment we can almost (I think?) just copy things like prox > postings, stored fields, term vectors, instead of full parsing to > Jave objects and then re-serializing them. > * Experiment with mergeFactor / different merge policies. For > example I think LUCENE-854 would reduce time spend merging for a > given index size. > This is currently just a place to list ideas for optimizing segment > merges. I don't plan on working on this until after LUCENE-843. > Note that for "autoCommit=false", this optimization is somewhat less > important, depending on how often you actually close/open a new > IndexWriter. In the extreme case, if you open a writer, add 100 MM > docs, close the writer, then no segment merges happen at all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org