Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of DORONC@il.ibm.com designates
 195.212.29.152 as permitted sender)
In-Reply-To: 
 <OF050AD880.7D28BA54-ON8825720B.007A54CB-8825720B.007B8B98@il.ibm.com>
Subject: Re: flushRamSegments possible perf improvement?
To: java-dev@lucene.apache.org
Message-ID: 
 <OF24AF29B3.F9BDDDEF-ON8825720C.002921D6-8825720C.002A7CD1@il.ibm.com>
From: Doron Cohen <DORONC@il.ibm.com>
Date: Wed, 18 Oct 2006 23:44:04 -0800
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII

Ok, I tested this approach - not a clean code yet, just enough to test if
indeed there is potential improvement here, and I think there is.

Performance results for the (short) tests I ran on my everyday machine:

(read as: [oldTimeMillis] to [newTimeMillis] is [speed-up] for adding: n
docs, maxBuffered=x mergeFactor=y)

--- "new" runs before "old" ---
3605 to 2964 is 17%  for: 500 docs, buf=10 mrg=3
2163 to 1923 is 11%  for: 2000 docs, buf=100 mrg=4
6990 to 5759 is 17%  for: 8000 docs, buf=200 mrg=5
20529 to 18286 is 10%  for: 32000 docs, buf=400 mrg=6
44444 to 39677 is 10%  for: 64000 docs, buf=1000 mrg=7

--- "old" runs before "new" ---
3926 to 2434 is 38%  for: 500 docs, buf=10 mrg=3
2233 to 1732 is 22%  for: 2000 docs, buf=100 mrg=4
6199 to 5678 is 8%  for: 8000 docs, buf=200 mrg=5
20139 to 16955 is 15%  for: 32000 docs, buf=400 mrg=6
42220 to 39507 is 6%  for: 64000 docs, buf=1000 mrg=7

I will submit this in a Jira issue.

Thoughts anyone?
Any particular other setting you think should be tested?

- Doron

Doron Cohen/Haifa/IBM@IBMIL wrote on 18/10/2006 15:29:26:
>
> Currently IndexWriter.flushRamSegments() always merge all ram segments to
> disk. Later it may merge more, depending on the maybe-merge algorithm.
This
> happens at closing the index and when the number of (1 doc) (ram)
segments
> exceeds max-buffered-docs.
>
> Can there be a performance penalty for always merging to disk first?
>
> Assume the following merges take place:
>   merging segments _ram_0 (1 docs) _ram_1 (1 docs) ... _ram_N (1_docs)
into
> _a (N docs)
>   merging segments _6 (M docs) _7 (K docs) _8 (L docs) into _b (N+M+K+L
> docs)
>
> Alternatively, we could tell (compute) that this is going to happen, and
> have a single merge:
>   merging segments _ram_0 (1 docs) _ram_1 (1 docs) ... _ram_N (1_docs)
>                    _6 (M docs) _7 (K docs) _8 (L docs) into _b (N+M+K+L
> docs)
>
> This would save writing the segemnt of size N to disk and reading it
again.
> For large enough N, Is there really potential save here?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org