lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7976) Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments
Date Fri, 25 May 2018 00:40:00 GMT


Erick Erickson commented on LUCENE-7976:

[~mikemccand] OK, I'm back at it.

------------easy parts.
I'll incorporate all your style suggestions, NP there. Except for twoMayHaveBeenMerged. That's
not a typo, it's meant to cover the case where there _could_ be one tiny segment and findForcedMerges
legitimately combines that small segment with a larger one even if there are no deletions
while still respecting maxMergedSegmentSizeMB. I've added a bit more explanation in the comments
though, if it confused you it would confuse others. Maybe I'll be able to think of a better

bq. Can you change doFindMerges to not modify the incoming list...
I've done that. Pending the resolution to maxSegments, holdInCase may not be necessary though
in which case copying to a local variable is unnecessary. I'm not sure I care given how expensive
merging is in general, one extra list copy is probably immeasurable (and cleaner).

------hard part 1
bq. These parts still seem smelly to me:
Yeah, me too.

bq. Or, maybe we can change the problem: when you forceMerge to N segments, should we really
allow violating the maxMergedSegmentMB constraint in that case?

Frankly I'd like to have exactly one valid value for maxSegments: 1 . That is, the only choices
would be to merge down to 1 segment or respect maxMergedSegmentMB. WDYT?

Another option is to relax what specifying maxSegments means to a "best effort". If we just
calculate the max segment size based on the number of segments specified and not worry about
meeting it exactly. Then the loop goes away. In that case, the 25% (or whatever) fudge factor
isn't dangerous since there'd be no loop, but fudging it up isn't totally necessary either.

I do wonder who, if anybody, uses maxSegments except to avoid the issue that started this,
that is optimize creating a large segment that can accumulate lots of deleted docs. Has it
outlived its usefulness?

---- hard part 2
bq. And I think that's the line you were asking me about with this?
bq. But I think that's dangerous – won't this mean that we will fail to return a merge that
could concurrently run with already merging segments, in the cascading case?

So the scenario here is that we have a long-running merge running on one thread and, even
though we could be running a dozen other small merges we wouldn't run any of them until the
big one was done, right?

OK, I'll put this back.

--------------- TBD then:
So I think the only thing remaining from your review is what to do about maxSegments and that
really horrible loop. Options in the order I like ;)

1> Remove the loop entirely. Calculate the theoretical segment size and add 25% (or whatever)
and call doFindMerges once. The purpose of the loop is to handle the case where that calculations
results in N+M segments when N was specified but M additional segments were created because
they didn't pack well (your bin packing characterization). Update the docs to say it's on
a "best effort" basis.

This option preserves most of the intent of maxSegments without introducing the complexity/danger.
And I think you're right, with enough segments we'd never get out of that loop.

2> Allow maxSegments to have only one option: 1. Otherwise we just respect maxMergedSegmentSizeMB.
Users can approximate doing forceMerge with maxSegments with values other than 1 by configuring
their maxMergedSegmentsSizeMB to suit their wishes and do a forceMerge without specifying

So all told this is closer than I feared when I saw how long your response was ;). If we decide
to go with option <1> for maxSegments, all that would remain after that is putting back
the bits with find merges even if there are merges currently running.

> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large
> -------------------------------------------------------------------------------------------------
>                 Key: LUCENE-7976
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch,
LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch
> We're seeing situations "in the wild" where there are very large indexes (on disk) handled
quite easily in a single Lucene index. This is particularly true as features like docValues
move data into MMapDirectory space. The current TMP algorithm allows on the order of 50% deleted
documents as per a dev list conversation with Mike McCandless (and his blog here:
> Especially in the current era of very large indexes in aggregate, (think many TB) solutions
like "you need to distribute your collection over more shards" become very costly. Additionally,
the tempting "optimize" button exacerbates the issue since once you form, say, a 100G segment
(by optimizing/forceMerging) it is not eligible for merging until 97.5G of the docs in it
are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like <maxAllowedPctDeletedInBigSegments>
(no, that's not serious name, suggestions welcome) which would default to 100 (or the same
behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 5G, the following
would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO MATTER
HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with smaller
segments to bring the resulting segment up to 5G. If no smaller segments exist, it would just
be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize).
It would be rewritten into a single segment removing all deleted docs no matter how big it
is to start. The 100G example above would be rewritten to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the default would
be the same behavior we see now. As it stands now, though, there's no way to recover from
an optimize/forceMerge except to re-index from scratch. We routinely see 200G-300G Lucene
indexes at this point "in the wild" with 10s of  shards replicated 3 or more times. And that
doesn't even include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A new merge
policy is certainly an alternative.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message