Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (hermes.apache.org: domain of yseeley@gmail.com designates
 64.233.170.195 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=ZW1pNAN7JvUVg9iylAYq4RrD+qf2SadUOke/RWxlAIL+vNkMJ2Yu7BbjJbTodW5IbqAVi1+0PJfQ8KYMV2Hx3l5oSt0RntOjfqDfVsUdoquQVDTBXXcgBTNhhhe6EQBI6fFmlJhnWfa9HHu0mwdx6bGg84AHDV7zOaXJFOgVwXY=
Message-ID: <c68e391705051609012e6faf58@mail.gmail.com>
Date: Mon, 16 May 2005 12:01:57 -0400
From: Yonik Seeley <yseeley@gmail.com>
Reply-To: Yonik Seeley <yseeley@gmail.com>
To: java-dev@lucene.apache.org
Subject: Re: [Performance]: IndexWriter again...
In-Reply-To: <C51320C2-2262-4DF6-9BCD-4B0E3BB86D5F@aconex.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <C51320C2-2262-4DF6-9BCD-4B0E3BB86D5F@aconex.com>

I like the idea Paul.

As far as how it should be implemented, perhaps a count of docs in
memory should be kept.  It doesn't seem necessary to traverse all of
the segments on every add (it's a linear operation, and will only
result in a merge every "minMergeDocs" or "maxBufferedDocs").

-Yonik

On 5/16/05, Paul Smith <psmith@aconex.com> wrote:
> In summary, I still firmly believe that the IndexWriter.maybeMergeSegment=
s()
> is chewing a lot more CPU than would be ideal.  So I ran a simple test.  =
I
> ran the same test I've done before, using mergeFactor(1000)
> maxBufferedDocs(10000), useCompondFile(false), indexing 5 fields (user
> first/lastname/email address)
>=20
> As a baseline using the latest SVN source code, I'm getting an indexing r=
ate
> of between 490-515 items/second of a number of runs.
>=20
> By applying the attached simple patch to IndexWriter, I'm getting between
> 945-970 of a number of test runs.  That's a significant speed up.  All th=
e
> patch is doing is deferring the call to maybeMergeSegments so it only doe=
s
> it every 2000 iterations (2000 is totally arbitrary on my part).
>=20
> I've verified with Luke that the index generated contains the same #
> documents, and same # terms, but I have not had a chance to properly setu=
p
> my local environment to run the test cases. =20
>=20
> Obviously the attached patch is a dirty hack of the highest order. In my
> case I'm re-indexing from scratch every time, so there may be a reason wh=
y
> we shouldn't be doing this sort of deferring of method calls.  Perhaps th=
e
> source code is optimized around incremental/batch updates to _existing_
> indexes, but creating a new index, but with a penalty of creating a new
> index performs slower than one would like.
>=20
> Perhaps IndexWriter could benefit from another setting that lets one
> configure how often to call maybeMergeSegments()?  That could of course
> confuse more people than it helps.
>=20
> I would really appreciate anyones thoughts on this, I'll be very happy to=
 be
> proven wrong because it will just help me understand more of Lucene.  I
> would hope that speeding up indexing would benefit everyone?  Particularl=
y
> the large scale sites out there.
>=20
> cheers,
>=20
> Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org