lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize
Date Fri, 11 Feb 2011 15:35:57 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-854:
--------------------------------------

    Attachment: LUCENE-854.patch

I created a new merge policy, to take advantage of non-contiguous merging (LUCENE-1076) and
fix certain limitations of LogMergePolicy.

The new policy does not support contiguous merging, and always merges according to byte size,
always pro rated by pct deletes.

The policy's core logic is similar to LogMP, in that it tries to merge roughly equal sized
segments at once, maxMergeAtOnce (renamed from mergeFactor) at a time, resulting in the usual
exponential staircase pattern when you feed it roughly equal sized segments.

You configure the approx max merged segment size (unlike LogMP where you configure the max
to-be-merged size, which was always a source of confusion!).  Unlike LogMP, when segments
are getting close to being too large, the new policy will merge fewer segs, eg down to merging
pairwise, to reach approx the max allowed size.  This is important since it makes that setting
more "accurate"; I now default it to 5 GB (vs LogMP's 2 GB).

There is a separate maxMergeAtOnceExplicit that controls "explicit" merging (ie, app calls
optimize or expungeDeletes, and maybe in the future also addIndexes); I defaulted it to 30.
 There is no max segment size for optimize.

The big difference vs LogMP is that the new policy does not "over-merge", meaning it does
not "pay it forward"/forcefully cascade the way LogMP does today.  This fixes the "inadvertent
optimize" that LogMP does.

For any given sized index, the new policy computes a budget of how many segments that index
is allowed to have (ie, it enumerates the steps in the stair case, based on mergeAtOnce, [floored]
min segment size, and total bytes in the index); then, if the index is over-budget it picks
the least-cost merge.  This results in a smoother progression over time of number of segments.

There is a new configuration, segmentsPerTier, that lets you control how many segments per
level you can "tolerate".  This is a nice knob to turn to tradeoff merge cost vs search cost.
 It defaults to 10, which means it matches the staircase pattern that LogMP produces, but
you can now separately control the "width" of the stairs in the staircase, from how many segments
are merged at once for non-explicit merges.

It has useCompoundFile and noCFSRatio just like LogMP.

It has a new setting "expungeDeletesPctAllowed", default 10%, which allows expungeDeletes
to skip merging a segment if it has < 10% deletions.

I think we should keep LogMergePolicy available for apps that want contiguous merging, merge
by doc count, to not pro-rate by deletions, or to enforce a max segment size during optimize.
 But, with this, I'd remove the non-contiguous support for LogMergePolicy that was added under
LUCENE-1076, and make this new MP the default one.


> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
>
>                 Key: LUCENE-854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-854
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-854.patch
>
>
> The current merge policy, at every maxBufferedDocs *
> power-of-mergeFactor docs added, will do a fully cascaded merge, which
> is the same as an optimize.
> I think this is not good because at that "optimization poin", the
> particular addDocument call is [surprisingly] very expensive.  While,
> amortized over all addDocument calls, the cost is low, the cost is
> paid "up front" and in a very "bunched up" manner.
> I think of this as "pay it forward": you are paying the full cost of
> an optimize right now on the expectation / hope that you will be
> adding a great many more docs.  But, if you don't add that many more
> docs, then, the amortized cost for your index is in fact far higher
> than it should have been.  Better to "pay as you go" instead.
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor.  With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment.  Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.
> With this new merge policy, an index that's a bit bigger than a
> current "optimization point" would then have a lower amortized cost
> per document.  Plus the merge cost is less "bunched up" and less "pay
> it forward": instead you pay for what you are actually using.
> We can start by creating this merge policy (probably, combined with
> with the "by size not by doc count" segment level computation from
> LUCENE-845) and then later decide whether we should make it the
> default merge policy.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message