lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge
Date Thu, 16 Aug 2007 20:03:30 GMT


Michael McCandless commented on LUCENE-845:

> Or here's another random idea: maybe IndexReaders should load the
> tail of "small segments" into a RAMDirectory, for each one. Ie, an
> IndexReader is given a RAM buffer "budget" and it spends it on any
> numerous small segments in the index....?

Following up on this ... I think IndexReader could load "small tail
segments" into RAMDirectory and then do a merge on them to make
search even faster.  It should typically be extremely fast if we set the
defaults right, and RAM usage should be quite low since merging
small segments usually gives great compression in net bytes used.

This would allow us to avoid (or, minimize) the O(N^2) cost on merging
to ensure that an index is "at all instants" ready for a reader to
directly load it.  This basically gives us our "merge tail segments on
demand when a reader refreshes".

We can do a combination of these two approaches, whereby the
IndexWriter is free to make use a "long tail" of segments so it
doesn't have O(N^2) slowdown on merge cost, yet a reader pays very
small (one-time) cost for such segments.

I think the combination of these two changes should give a net/net
sizable improvement on "low latency" apps.... because IndexWriter is
free to make miniscule segments (document by document even) and
IndexReader (especially combined with LUCENE-743) can quickly
re-open and do a "mini-optimize" on the tail segments and have
great performance.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>                 Key: LUCENE-845
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message