Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 50699 invoked from network); 4 Dec 2006 11:07:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Dec 2006 11:07:48 -0000 Received: (qmail 82231 invoked by uid 500); 4 Dec 2006 11:07:54 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 82175 invoked by uid 500); 4 Dec 2006 11:07:54 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 82162 invoked by uid 99); 4 Dec 2006 11:07:54 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Dec 2006 03:07:53 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Dec 2006 03:07:44 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id E2FF47142BF for ; Mon, 4 Dec 2006 03:07:23 -0800 (PST) Message-ID: <33177734.1165230443927.JavaMail.jira@brutus> Date: Mon, 4 Dec 2006 03:07:23 -0800 (PST) From: "Michael Busch (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Closed: (LUCENE-624) Segment size limit for compound files In-Reply-To: <32782770.1152655350157.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ] Michael Busch closed LUCENE-624. -------------------------------- Resolution: Won't Fix Assignee: Michael Busch I'm closing this issue, because: - no votes or comments for almost half a year - only indexing performance benefits slightly from this feature - another config parameter in IndexWriter will probably confuse users more than help them > Segment size limit for compound files > ------------------------------------- > > Key: LUCENE-624 > URL: http://issues.apache.org/jira/browse/LUCENE-624 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assigned To: Michael Busch > Priority: Minor > Attachments: cfs_seg_size_limit.patch > > > Hello everyone, > I implemented an improvement targeting compound file usage. Compound files are used to decrease the number of index files, because operating systems can't handle too many open file descriptors. On the other hand, a disadvantage of compound file format is the worse performance compared to multi-file indexes: > http://www.gossamer-threads.com/lists/lucene/java-user/8950 > In the book "Lucene in Action" it's said that compound file format is about 5-10% slower than multi-file format. > The patch I'm proposing here adds the ability to the IndexWriter to use compound format only for segments, that do not contain more documents than a specific limit "CompoundFileSegmentSizeLimit", which the user can set. > Due to the exponential merges, a lucene index usually contains only a few very big segments, but much more small segments. The best performance is actually just needed for the big segments, whereas a slighly worse performance for small segments shouldn't play a big role in the overall search performance. > Consider the following example: > Index Size: 1,500,000 > Merge factor: 10 > Max buffered docs: 100 > Number of indexed fields: 10 > Max. OS file descriptors: 1024 > in the worst case a not-optimized index could contain the following amount of segments: > 1 x 1,000,000 > 9 x 100,000 > 9 x 10,000 > 9 x 1,000 > 9 x 100 > That's 37 segments. A multi-file format index would have: > 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files ==> only about 2 open indexes per machine could be handled by the operating system > A compound-file format index would have: > 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be handled by the operating system, but performance would be 5-10% worse. > A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 would have: > 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open indexes could be handled by the OS > The OS can handle now 20 instead of just 2 open indexes, while maintaining the multi-file format performance. > I'm going to create diffs on the current HEAD and will attach the patch files soon. Please let me know what you think about this improvement. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org