lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Thu, 30 Aug 2007 09:10:30 GMT


Michael McCandless commented on LUCENE-847:

Thanks for the detailed review Ning!

> 1 As you pointed out, "the merge policy is no longer responsible for
> running the merges itself". MergePolicy.maybeMerge simply returns a
> merge specification. But ConcurrentMergePolicyWrapper.maybeMerge
> actually starts concurrent merge threads thus doing the merges.

True, but I was thinking CMPW could be an exception to this rule.  I
guess I would change the rule to "simple merge policies don't have to
run their own merges"?

However I do agree, CMPW is clearly a different beast from a typical
merge policy because it entails scheduling, not selection, of merges.

> 2 Related to 1, cascading is done in IndexWriter in non-concurrent
> case. But in concurrent case, cascading is also done in merge
> threads which are started by
> ConcurrentMergePolicyWrapper.maybeMerge.

Good point...  I think I could refactor this so that cascading logic
lives entirely in one place IndexWriter.

> MergePolicy.maybeMerge should continue to simply return a merge
> specification. (BTW, should we rename this maybeMerge to, say,
> findCandidateMerges?)

Good!  I like findCandidateMerges better; I'll change it.

> Can we carve the merge process out of IndexWriter into a Merger?
> IndexWriter still provides the building blocks - merge(OneMerge),
> mergeInit(OneMerge), etc. Merger uses these building blocks. A
> ConcurrentMerger extends Merger but starts concurrent merge threads
> as ConcurrentMergePolicyWrapper does.

How would this be used?  Ie, how would one make an IndexWriter that
uses the ConcurrentMerger?  Would we add expert methods
IndexWriter.set/getIndexMerger(...)?  (And presumably the mergePolicy
is now owned by IndexMerger so it would have the

Also, how would you separate what remains in IW vs what would be in

I like this approach in principle; I'm just trying to hash out the

> 1 UpdateDocument's and deleteDocument's bufferDeleteTerm are
> synchronized on different variables in this patch.

Woops, good catch!  I'll fix.

> However, the semantics of updateDocument changed since
> LUCENE-843. Before LUCENE-843, updateDocument, which is a delete and
> an insert, guaranteed the delete and the insert are committed
> together (thus an update). Now it's possible that they are committed
> in different transactions. If we consider DocumentsWriter as the RAM
> staging area for IndexWriter, then deletes are also buffered in RAM
> staging area and we can restore our previous semantics, right?

Hmm ... you're right.  This is a separate issue from merge policy,
right?  Are you proposing buffering deletes in DocumentsWriter

> 2 OneMerge.segments seems to rely on its segment infos' reference to
> segment infos of IndexWriter.segmentInfos. The use in commitMerge,
> which calls ensureContiguousMerge, is an example. However,
> segmentInfos can be a cloned copy because of exceptions, thus the
> reference broken.

Good catch!  How to fix?  One thing we could do is always use
SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
level.  This way using the default 'equals' (same instance) would
work.  Or we could establish identity (equals) of a SegmentInfo as
checking if the directory plus segment name are equal?  I think I'd
lean to the 2nd option....

> 3 Calling optimize of an IndexWriter with the current
> ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec
> returned by MergePolicy.optimize may be in conflict with a
> concurrent merge (the same merge spec will be returned without
> changes to segmentInfos), but a concurrent merge cannot finish
> because optimize is holding the lock.

Hmmm yes.  In fact I think we can remove synchronized from optimize
altogether since within it we are synchronizing(this) at the right
places?  If more than one thread calls optimize at once, externally,
it is actually OK: they will each pick a merge that's viable
(concurrently) and will run the merge, else return once there is no
more concurrency left.  I'll add a unit test that confirms this.

> 4 Finally, a couple of minor things:
>   1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo
>     info) and useCompoundDocStore(SegmentInfos infos): why the
>     parameters?

Well, useCompoundFile(...) is given a single newly flushed segment and
should decide whether it should be CFS.  Whereas
useCompoundDocStore(...) is called when doc stores are flushed.  When
autoCommit=false, segments can share a single set of doc stores, so
there's no single SegmentInfo to pass down.

> 2 Do we need doMergeClose in IndexWriter? Can we simply close a
>   MergePolicy if not null?

Good point.  I think this is reasonable -- I'll fix.

> Factor merge policy out of IndexWriter
> --------------------------------------
>                 Key: LUCENE-847
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>             Fix For: 2.3
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.patch.txt,
LUCENE-847.take3.patch, LUCENE-847.take4.patch, LUCENE-847.txt
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message