lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-856) Optimize segment merging
Date Thu, 05 Apr 2007 16:43:32 GMT


Michael McCandless commented on LUCENE-856:

OK I re-ran the above test (10 MM docs @ ~5,500 bytes plain text each)
with autoCommit=false: this time it took 5 hrs 7 minutes, which is
40.7% faster than the autoCommit=true test above.

Both of these tests were run with the patch from LUCENE-843.

So this means, if all you need to do is build a massive index with
term vector positions & offsets, the fastest way to do so is with the
patch from LUCENE-843 and with autoCommit=false with your writer.

Basically LUCENE-843 makes autoCommit=false quite a bit faster for a
very large index, assuming you are storing term vectors / stored

Still, I think optimizing segment merging is important because for
many uses of Lucene, the "interactivity" (how quickly a searcher sees
the recently indexed documents) is very important.  For such cases you
should open a writer with autoCommit=false and then periodically close
& re-open it to publish the indexed documents to the searchers.  With
that model, segment merging will still be a factor slowing down indexing
(though how much of a factor depends on how often you close/open
your writers).

> Optimize segment merging
> ------------------------
>                 Key: LUCENE-856
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
> With LUCENE-843, the time spent indexing documents has been
> substantially reduced and now the time spent merging is a sizable
> portion of indexing time.
> I ran a test using the patch for LUCENE-843, building an index of 10
> million docs, each with ~5,500 byte plain text, with term vectors
> (positions + offsets) on and with 2 small stored fields per document.
> RAM buffer size was 32 MB.  I didn't optimize the index in the end,
> though optimize speed would also improve if we optimize segment
> merging.  Index size is 86 GB.
> Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes
> of which was spent merging.  That's 65.6% of the time!
> Most of this time is presumably IO which probably can't be reduced
> much unless we improve overall merge policy and experiment with values
> for mergeFactor / buffer size.
> These tests were run on a Mac Pro with 2 dual-core Intel CPUs.  The IO
> system is RAID 0 of 4 drives, so, these times are probably better than
> the more common case of a single hard drive which would likely be
> slower IO.
> I think there are some simple things we could do to speed up merging:
>   * Experiment with buffer sizes -- maybe larger buffers for the
>     IndexInputs used during merging could help?  Because at a default
>     mergeFactor of 10, the disk heads must do alot of seeking back and
>     forth between these 10 files (and then to the 11th file where we
>     are writing).
>   * Use byte copying when possible, eg if there are no deletions on a
>     segment we can almost (I think?) just copy things like prox
>     postings, stored fields, term vectors, instead of full parsing to
>     Jave objects and then re-serializing them.
>   * Experiment with mergeFactor / different merge policies.  For
>     example I think LUCENE-854 would reduce time spend merging for a
>     given index size.
> This is currently just a place to list ideas for optimizing segment
> merges.  I don't plan on working on this until after LUCENE-843.
> Note that for "autoCommit=false", this optimization is somewhat less
> important, depending on how often you actually close/open a new
> IndexWriter.  In the extreme case, if you open a writer, add 100 MM
> docs, close the writer, then no segment merges happen at all.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message