lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Thu, 22 Mar 2007 22:18:56 GMT

On Thu, 22 Mar 2007 13:34:39 -0700, "Steven Parkes" <> said:
> > EG if you set maxBufferedDocs to say 10000 but then it turns out based
> > on RAM usage you actually flush every 300 docs then the merge policy
> > will incorrectly merge a level 1 segment (with 3000 docs) in with the
> > level 0 segments (with 300 docs).  This is because the merge policy
> > looks at the current value of maxBufferedDocs to compute the levels
> > so a 3000 doc segment and a 300 doc segment all look like "level 0".
> Are you calling the 3K segment a level 1 segment because it was created
> from level 0 segments? Because based on size, it is a level 0 segment,
> right? With the current merge policy, you can merge level n segments and
> get a level n segment. Deletes will do this, plus other things like
> changing merge policy parameters and combining indexes.

Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.

> Because based on size, it is a level 0 segment, right?

Well, I don't think it's right to call something level 0 just because
it's under the the current maxBufferedDocs.

Backing up a bit ... I think the lowest amortized cost merge policy
should always try to merge roughly equal sized segments subject to
restrictions of 1) max # segments that can be merged at once
(mergeFactor) presumably due to file descriptor limits and/or
substantial degradation in merge performance as mergeFactor increases
eg due to lack of concurrency in IO system (??) and 2) that you must
merge adjacent segments I think (so docIDs, though changing, remain

Actually is #2 a hard requirement?  Do the loose ports of Lucene
(KinoSearch, Ferret, etc.) also follow this restriction?  We say that
developers should not rely on docIDs but people still seem to rely on
their monotonic ordering (even though they change).

Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be "serviced" (read in, written out) as
part of a merge.  I think if N equal sized segments are always merged
then the # copies for each byte of data will be minimized?

So, the fact that due to this bug we will merge a 3000 doc segment
with 9 300 doc segments is not efficient (amortized) because those
3000 docs in the first segment will net/net have to get merged again
far sooner than they would have had they been merged with 9 3000 doc

I think instead of calling segments "level N" we should just measure
their net sizes and merge on that basis?

> Leads to the question of what is "over merging". The current merge
> policy doesn't consider the size of the result, it simply counts the
> number of segments at a level. Do you think this qualifies as over
> merging? It still should only merge when there are mergeFactor segments
> at a level, so you shouldn't be doing too terribly much merging.  And
> you have to be careful not to do less, right? By bounding the number of
> segments at each level, you ensure that your file descriptor usage only
> grows logarithmically.

Yes, at no time should you merge more than mergeFactor segments at once.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message