lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge
Date Thu, 16 Aug 2007 09:19:31 GMT


Michael McCandless commented on LUCENE-845:

> This increases file descriptor usage in some cases, right? In the
> old scheme, if you set mergeFactor to 10 and maxBufferedDocs to
> 1000, you'd only get 10 segments with size <= 1000. But with this
> code, you can't bound that anymore. If I create single doc segments
> (perhaps by flushing based on latency), I can get 30 of them?

Right, the # segments allowed in the index will be more than it is w/
the current merge policy if you consistently flush with [far] fewer
docs than maxBufferedDocs is set to.

But, this is actually the essense of the bug.  The case we're trying
to fix is where you set maxBufferedDocs to something really large (say
1,000,000) to avoid flushing by doc count, and you setRamBufferSizeMB
to something like 32 MB.  In this case, the current merge policy would
just keep merging any set of 10 segments with < 1,000,000 docs each,
such that eventually all your indexing time is being spent doing
highly sub-optimal merges.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>                 Key: LUCENE-845
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message