Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 60839 invoked from network); 25 May 2007 20:11:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 May 2007 20:11:41 -0000 Received: (qmail 3363 invoked by uid 500); 25 May 2007 20:11:42 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 3304 invoked by uid 500); 25 May 2007 20:11:42 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 3264 invoked by uid 99); 25 May 2007 20:11:42 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2007 13:11:42 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2007 13:11:36 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 93CC3714068 for ; Fri, 25 May 2007 13:11:16 -0700 (PDT) Message-ID: <33329361.1180123876602.JavaMail.jira@brutus> Date: Fri, 25 May 2007 13:11:16 -0700 (PDT) From: "robert engels (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes In-Reply-To: <28979694.1179926596715.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499214 ] robert engels commented on LUCENE-888: -------------------------------------- I think the important consideration is how expensive is the system call. Since the system call requires JNI, it MIGHT be expensive. Another important consideration is buffer utilization. It is my understanding that the OS will perform read-ahead normally only in sequential access only, outside of the additional bytes read to optimize the physical read. If Lucene performs indexed reads but the data is actually being accessed sequential, Lucene managing its own buffers can far more effective. Along these lines, if the server is heavily used for a variety of applications Lucene can manage its own buffers more efficiently - similar to how a database almost always (every commercial one I know) has its own buffer cache and does not rely on the OS. In a GC environment though it may be even more imporant if the buffers were managed/reused from a pool as you avoid the GC overhead. Just my thoughts. > Improve indexing performance by increasing internal buffer sizes > ---------------------------------------------------------------- > > Key: LUCENE-888 > URL: https://issues.apache.org/jira/browse/LUCENE-888 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-888.patch > > > In working on LUCENE-843, I noticed that two buffer sizes have a > substantial impact on overall indexing performance. > First is BufferedIndexOutput.BUFFER_SIZE (also used by > BufferedIndexInput). Second is CompoundFileWriter's buffer used to > actually build the compound file. Both are now 1 KB (1024 bytes). > I ran the same indexing test I'm using for LUCENE-843. I'm indexing > ~5,500 byte plain text docs derived from the Europarl corpus > (English). I index 200,000 docs with compound file enabled and term > vector positions & offsets stored plus stored fields. I flush > documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to > not hit LUCENE-845. The resulting index is 1.7 GB. The index is not > optimized in the end and I left mergeFactor @ 10. > I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO > system. > At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if > I increase both buffers to 8 KB it takes 554 sec to build the index, > which is an 11% overall gain! > I will run more tests to see if there is a natural knee in the curve > (buffer size above which we don't really gain much more performance). > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE > at 1024, at least for now. During searching there can be quite a few > of this class instantiated, and likely a larger buffer size for the > freq/prox streams could actually hurt search performance for those > searches that use skipping. > The CompoundFileWriter buffer is created only briefly, so I think we > can use a fairly large (32 KB?) buffer there. And there should not be > too many BufferedIndexOutputs alive at once so I think a large-ish > buffer (16 KB?) should be OK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org