lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Thu, 16 Aug 2007 21:11:31 GMT


Michael McCandless commented on LUCENE-847:

> Note that merge() was added not for users (which I have no strong
> opinion about) but so that, potentially, CMP can check again for
> merges when a set of merge threads completes, i.e., cascade.

OK, got it.  In fact, then it seems more important that we NOT flush
at this point?

> > I think instead we should leave the methods, not deprecated, as
> > convenience (sugar) methods. Simple things should be simple;
> > complex things should be possible.
> I think this argues for a LegacyMergePolicy interface again, then?
> If we change the default merge policy and someone changes their code
> to use LogDoc for their own purposes, in both cases the
> getters/setters should work? So cast to the interface and as long as
> the merge policy supports this, the getters/setters work (unless the
> merge policy decides to throw within), otherwise the getters/setters
> throw?

I don't think so: I think if someone changes the merge policy to
something else, it's fine to require that they then do settings
directly through that merge policy.  I don't think we should bring
back the LegacyMergePolicy interface.

> > Uh, no: when someone calls optimize that means it really must be
> > done, right? So "optimize" is the right name I think.
> Yeah, but it might do nothing. Just as merge might do nothing.

Well... that's the exception not the rule.  My vote would be for
"maybeMerge(...)"  and "optimize(..)".

> > Though, this raises the tricky question of index consistency ...
> Definitely. I'm still trying to understand all the subtleties here.
> > IndexWriter commits the new segments file right after
> > mergePolicy.merge returns ... so for CMP we suddenly have an
> > unusable index (as seen by an IndexReader).
> How so? I figured that after mergePolicy.merge returns, in the case
> of CMP with an ongoing merge, segmentInfos won't have changed at
> all. Is that a problem?

This is inside addIndexes that we're talking about.  It will have
changed because the added indexes were stuck into the segmentInfos.
If you commit that segmentInfos, which now references segments in
other directories, the index is inconsistent, until the merge policy
finishes its work (including copying over segments from other dirs).
In fact this used to be an issue but was fixed in LUCENE-702.

> > Maybe it's too ambitious to allow merges of segments from other
> > directories to run concurrently?
> Yeah, that might be the case. At least as a default?

I think it's worse: I think we shouldn't allow any mergePolicy to
leave the index inconsistent (failing to copy over segments from other
directories).  I think it's a bug if the mergePolicy does that and we
should check & raise an exception, and not commit the new segments
file.   IndexWriter should in general protect itself from a mergePolicy
that makes the index inconsistent (and, refuse to commit the resulting
segments file).

With the proposed "stateless API" we would keep calling the
mergePolicy, after each merge, until it returned null, and then do the
check that index is consistent.

> > I would consider it a hard error in IndexWriter if after calling
> > mergePolicy.merge from any of the addIndexes*, there remain
> > segments in other directories. I think we should catch this &
> > throw an exception?
> It would be easy enough for CMP to block in this case, rather than
> returning immediately. Wouldn't that be better? And I suppose it's
> possible to imagine an API on CMP for specifying this behavior?

I think CMP should indeed block in this case.  I wouldn't add an API
to change it.  It's too dangerous to allow an index to become

> > I opened a separate issue LUCENE-982 to track this.
> I think this is good. I think it's an interesting issue but not
> directly related to the refactor?

I think it is related: we should not preclude it in this refactoring.
I think we should fix MergePolicy.optimize to take "int

> > Although ... do you think we need need some way for merge policy
> > to state where the new segment should be inserted into
> > SegmentInfos?
> Right now I assumed it would replace the left most-segment.
> Since I don't really know the details of what such a merge policy
> would like, I don't really know what it needs.
> If you've thought about this more, do you have a suggestion? I
> suppose we could just add an int. But, then again, I'd do that as a
> separate function, leaving the original available, so we can do this
> later, completely compatibly?

I don't have a suggestion :)  And I agree, this is safely postponed while
keeping future backwards compatibility, so, punt!

> > How can the user close IndexWriter and abort the running merges?  I
> > guess CMP would provide a method to abort any running merges, and
> > user would first call that before calling IndexWriter.close?
> I hadn't really thought about this but I can see that should be made
> possible. It's always safe to abandon a merge so it should be
> available, for fast, safe, and clean shutdown.

OK.  Seems like a CMP specific issue (doesn't impact this discussion).

> > True, LogDoc as it now stands would never exploit concurrency (it
> > will always return the highest level that needs merging). But, we
> > could relax that such that if ever the lowest level has >
> > 2*mergeFactor pending segments to merge then we select the 2nd
> > set.
> Okay. But it will always return that? Still doesn't sound
> concurrent?

No, after another N (= mergeFactor) flushes, it would return a new
suggested merge.  I think this gives CMP concurrency to work with.
Also I think other merge policies (eg the rough suggestion in
LUCENE-854) could provide substantial concurrency.

> The thing is, the serial merge policy has no concept of concurrent
> merges, so if the API is always to select the best merge, until a
> pervious merge finishes, it will always return that as the best
> merge.
> Concurrent is going to require, by hook or by crook, that a merge
> policy be able to generate a set of non-conflicting merges, is it
> not?

Correct, if we want more than 1 merge running at once then
mergePolicy must provide non-conflicting merges.

But, providing just a single concurrent merge already gains us
concurrency of merging with adding of docs.  Just that is a great step
forward, and, it's not clear we can expect performance gains by doing
2 merges simultaneously with adding docs.  Have you tested this to

If we think there are still gains there, we can use the idea above, or
apps can use other merge policies (like LUCENE-854) that don't always
choose non-concurrent (conflicting) merges.

> I think the LUCENE-845 merge policy does this now, given that CMP
> gathers up the merge calls. I'm not sure the current LUCENE-847
> merge policy does (I'd have to double check) because it sometimes
> will try to use the result of the current merge in the next
> merge. The LUCENE-845 merge doesn't try to do this which is a(n)
> (inconsequential) change?

Right, the LUCENE-845 merge policy doesn't look @ the return result of
"merge".  It just looks at the newly created SegmentInfos.

Hmmmm, in fact, I think your CMP wrapper would not work with the merge
policy in LUCENE-845, right?  Ie, won't it will just recurse forever?
So actually I don't see how your CMP (using the current API) can in
general safely "wrap" around a merge policy w/o breaking things?

Whereas w/ stateless API, where merge policy just returns what should
be merged rather than executing it itself and cascading, would work

> > This is another benefit of the simplified API:
> > MergePolicy.maybeMerge would only be called with a lock already
> > acquired (by IndexWriter) on the segmentInfos.
> Do you really mean a lock on segmentInfos or just the lock on
> IndexWriter? I'm assuming the latter and I think this is the case
> for both API models.

But, if you lock on IndexWriter, what about apps that use multiple
threads to add documents and but don't use CMP?  When one thread gets
tied up merging, you'll then block on the other synchronized methods?
And you also can't flush from other threads either?  I think flushing
a new segment should be allowed to run concurrently with the merge?

Whereas if you lock only segmentInfos, and use the proposed stateless
API, I think the other threads would not be blocked?  I guess I don't
see the reason to synchronize on IndexWriter instead of segmentInfos.

Net/net I'm still thinking we should simplify this API to be
stateless.  I think there are a number of benefits:

  * We would no longer need to add a new IndexMerger interface that
    adds unecessary complexity to Lucnee (and, make the awkward
    decisions up front on which IndexWriter fields are allowed to be
    visible through the interface).

  * Keep CMP simpler (only top of stack (where I think "macro"
    concurrency should live), not top and bottom).

  * Work correctly as wrapper around other merge policies (ie not hit
    infinite recursion because mergePolicy had naturally assumed that
    "merge" would have changed the segmentInfos)

  * Allows locking on segmentInfos (not IndexWriter), and allows
    concurrency on multiple threads adding docs even without using

> Factor merge policy out of IndexWriter
> --------------------------------------
>                 Key: LUCENE-847
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.patch.txt,
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message