Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 10382 invoked from network); 16 May 2005 20:25:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 16 May 2005 20:25:47 -0000 Received: (qmail 31509 invoked by uid 500); 16 May 2005 16:09:08 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 31462 invoked by uid 500); 16 May 2005 16:09:06 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 31353 invoked by uid 99); 16 May 2005 16:09:05 -0000 X-ASF-Spam-Status: No, hits=0.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of yseeley@gmail.com designates 64.233.170.195 as permitted sender) Received: from rproxy.gmail.com (HELO rproxy.gmail.com) (64.233.170.195) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 16 May 2005 09:09:03 -0700 Received: by rproxy.gmail.com with SMTP id b11so890719rne for ; Mon, 16 May 2005 09:08:37 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=ZW1pNAN7JvUVg9iylAYq4RrD+qf2SadUOke/RWxlAIL+vNkMJ2Yu7BbjJbTodW5IbqAVi1+0PJfQ8KYMV2Hx3l5oSt0RntOjfqDfVsUdoquQVDTBXXcgBTNhhhe6EQBI6fFmlJhnWfa9HHu0mwdx6bGg84AHDV7zOaXJFOgVwXY= Received: by 10.38.59.31 with SMTP id h31mr402499rna; Mon, 16 May 2005 09:01:57 -0700 (PDT) Received: by 10.38.12.41 with HTTP; Mon, 16 May 2005 09:01:57 -0700 (PDT) Message-ID: Date: Mon, 16 May 2005 12:01:57 -0400 From: Yonik Seeley Reply-To: Yonik Seeley To: java-dev@lucene.apache.org Subject: Re: [Performance]: IndexWriter again... In-Reply-To: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I like the idea Paul. As far as how it should be implemented, perhaps a count of docs in memory should be kept. It doesn't seem necessary to traverse all of the segments on every add (it's a linear operation, and will only result in a merge every "minMergeDocs" or "maxBufferedDocs"). -Yonik On 5/16/05, Paul Smith wrote: > In summary, I still firmly believe that the IndexWriter.maybeMergeSegment= s() > is chewing a lot more CPU than would be ideal. So I ran a simple test. = I > ran the same test I've done before, using mergeFactor(1000) > maxBufferedDocs(10000), useCompondFile(false), indexing 5 fields (user > first/lastname/email address) >=20 > As a baseline using the latest SVN source code, I'm getting an indexing r= ate > of between 490-515 items/second of a number of runs. >=20 > By applying the attached simple patch to IndexWriter, I'm getting between > 945-970 of a number of test runs. That's a significant speed up. All th= e > patch is doing is deferring the call to maybeMergeSegments so it only doe= s > it every 2000 iterations (2000 is totally arbitrary on my part). >=20 > I've verified with Luke that the index generated contains the same # > documents, and same # terms, but I have not had a chance to properly setu= p > my local environment to run the test cases. =20 >=20 > Obviously the attached patch is a dirty hack of the highest order. In my > case I'm re-indexing from scratch every time, so there may be a reason wh= y > we shouldn't be doing this sort of deferring of method calls. Perhaps th= e > source code is optimized around incremental/batch updates to _existing_ > indexes, but creating a new index, but with a penalty of creating a new > index performs slower than one would like. >=20 > Perhaps IndexWriter could benefit from another setting that lets one > configure how often to call maybeMergeSegments()? That could of course > confuse more people than it helps. >=20 > I would really appreciate anyones thoughts on this, I'll be very happy to= be > proven wrong because it will just help me understand more of Lucene. I > would hope that speeding up indexing would benefit everyone? Particularl= y > the large scale sites out there. >=20 > cheers, >=20 > Paul Smith --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org