lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <>
Subject [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge
Date Mon, 30 Apr 2007 23:26:15 GMT


Steven Parkes commented on LUCENE-845:

Following up on this, it's basically the idea that segments ought to be created/merged both
either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ...

It does impact the APIs, though. It's easy enough to imagine, with factored merge policies,
both by-doc-count and by-segment-size policies. But the initial segment creation is going
to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm
and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in
an API handshake, but I'm not sure if it's worth the mess?

Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size
algorithms, since the algorithm for creating/merging segments has to be globally consistent,
not just per index, right?

If that is right, what does that say about making by-segment-size the default? It's gonna
break (as in bad results) people relying on that behavior that don't change their code. Is
there a community consensus on this? It's not really an API change that would cause a compile/class-load
failure, but in some ways, it's worse ...

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>                 Key: LUCENE-845
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message