lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge
Date Thu, 16 Aug 2007 18:29:31 GMT


Michael McCandless commented on LUCENE-845:

> Is there a change in filedescriptor use if you don't use setRamBufferSizeMB?

Yes.  EG, if you set maxBufferedDocs to 1000 but then flush after
every added doc, and you add 1000 docs, with the current merge policy,
every 10 flushes you will merge all segments together.  Ie, first
segment has 10 docs, then 20, 30, 40, 50, ..., 1000.  This is where
O(N^2) cost on merging comes from.  But, you will never have more than
10 segments in your index.

Whereas the new merge policy will make levels (segments of size 100,
10, 1) and merge only segments from the same level together.  So merge
cost will be much less (not O(N^2)), but, you will have more max segments
in the index (up to 1 + (mergeFactor-1) * log_mergeFactor(numDocs)),
or 28 segments in this example (I think).

Basically the new merge policy tries to make levels "all the way
down" rather than forcefully stopping when the levels get smaller than
maxBufferedDocs, to avoid the O(N^2) merge cost.

> One solution to this would be in cases like this to merge the small
> segments to one but not include the big segments. So you get [1000
> 10] where the last segment keeps growing until it reaches 1000. This
> does more copies than the current case, but always on small
> segments, with the advantage of a lower bound on the number of file
> descriptors?

I'm not sure that helps?  Because that "small segment" will have to
grow bit by bit up to 1000 (causing the O(N^2) cost).

Note that the goal here is to be able to switch to flushing by RAM
buffer size instead of docCount (and also merge by byte-size of
segments not doc count), by default, in IndexWriter.  But, even once
we do that, if you always flush tiny segments the new merge policy
will still build levels "all the way down".

Here's an idea: maybe we can accept the O(N^2) merge cost, when the
segments are "small"?  Ie, maybe doing 100 sub-optimal merges (in the
example above) does not amount to that much actual cost in practice.
(After all nobody has complained about this :).

I will run some tests.  Clearly at some point the O(N^2) cost will
dominate your indexing time, but maybe we can set a "rough" docCount
below which all segments are counted as a single level and not take
too much of a indexing performance hit.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>                 Key: LUCENE-845
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message