lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-856) Optimize segment merging
Date Mon, 02 Jul 2007 14:16:05 GMT


Michael McCandless commented on LUCENE-856:

I ran a new performance comparison here to test the merging cost of
autoCommit false vs true, this time using Wikipedia content and

I indexed all of Wikipedia using the patch from LUCENE-843 and the
patch from LUCENE-947, once with autoCommit=true and once with
autoCommit=false.  I used this alg (and just changed autocommit=true
to false for the second test):






    [{AddDoc}: *] : 4

    RepSumByPref AddDoc

Which means: use 4 threads to index all text from each of the 3.2
million wikipedia docs, with stored fields & term vectors turned on,
using SimpleAnalyzer, flushing when RAM usage hits 32 MB.

The index size is 20 GB.

Report from autoCommit=true:

    ------------> Report Sum By Prefix (AddDoc) (1 about 3204066 out of 3204073)
    Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
    AddDoc          0  3204066            1        226.3   14,159.22   282,843,296    373,480,960

    Net elapsed time = 87 minutes 18 seconds

Report from autoCommit=false:

    ------------> Report Sum By Prefix (AddDoc) (1 about 3204066 out of 3204073)
    Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
    AddDoc          0  3204066            1        407.6    7,860.63   252,046,000    329,962,048

    Net elapsed time = 60 minutes 5 seconds

Some comments:

  * According to net elapsed time, autoCommit=false is 31% faster than

  * According to "rec/s" it's actually 44% faster; this is because
    rec/s only measures the actual addDocument time and not eg the IO
    cost of retrieving the document contents.

  * The speedup is due entirely to the fact that the "doc stores"
    (vectors & stored fields) do not need to be merged when
    autoCommit=false.  This is a major win because these files are
    enormous if you turn on stored fields & term vectors with offsets
    & positions.

  * The basic conclusion is the same as before: if you want to build
    up a large index, and, it's not necessary to be searching this
    index while you are building it, the fastest way to do so is with
    LUCENE-843 patch and with autoCommit=false.

> Optimize segment merging
> ------------------------
>                 Key: LUCENE-856
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
> With LUCENE-843, the time spent indexing documents has been
> substantially reduced and now the time spent merging is a sizable
> portion of indexing time.
> I ran a test using the patch for LUCENE-843, building an index of 10
> million docs, each with ~5,500 byte plain text, with term vectors
> (positions + offsets) on and with 2 small stored fields per document.
> RAM buffer size was 32 MB.  I didn't optimize the index in the end,
> though optimize speed would also improve if we optimize segment
> merging.  Index size is 86 GB.
> Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes
> of which was spent merging.  That's 65.6% of the time!
> Most of this time is presumably IO which probably can't be reduced
> much unless we improve overall merge policy and experiment with values
> for mergeFactor / buffer size.
> These tests were run on a Mac Pro with 2 dual-core Intel CPUs.  The IO
> system is RAID 0 of 4 drives, so, these times are probably better than
> the more common case of a single hard drive which would likely be
> slower IO.
> I think there are some simple things we could do to speed up merging:
>   * Experiment with buffer sizes -- maybe larger buffers for the
>     IndexInputs used during merging could help?  Because at a default
>     mergeFactor of 10, the disk heads must do alot of seeking back and
>     forth between these 10 files (and then to the 11th file where we
>     are writing).
>   * Use byte copying when possible, eg if there are no deletions on a
>     segment we can almost (I think?) just copy things like prox
>     postings, stored fields, term vectors, instead of full parsing to
>     Jave objects and then re-serializing them.
>   * Experiment with mergeFactor / different merge policies.  For
>     example I think LUCENE-854 would reduce time spend merging for a
>     given index size.
> This is currently just a place to list ideas for optimizing segment
> merges.  I don't plan on working on this until after LUCENE-843.
> Note that for "autoCommit=false", this optimization is somewhat less
> important, depending on how often you actually close/open a new
> IndexWriter.  In the extreme case, if you open a writer, add 100 MM
> docs, close the writer, then no segment merges happen at all.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message