lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Sat, 18 Aug 2007 18:02:31 GMT


Michael McCandless commented on LUCENE-847:

> > I don't think so: I think if someone changes the merge policy to
> > something else, it's fine to require that they then do settings
> > directly through that merge policy.
> You're going to want to change the default merge policy, right?  So
> you're going to change the hard cast in IW to that policy? So it'll
> fail for anyone that wants to just getMergePolicy back to the old
> policy?

I don't really follow... my feeling is we should not deprecate
setUseCompoundFile, setMergeFactor, setMaxMergeDocs.

> > I think we shouldn't allow any mergePolicy to leave the index
> > inconsistent (failing to copy over segments from other
> > directories).
> That makes sense to me. CMP could enforce this, even in the case of
> concurrent merges.

I think IndexWriter should enforce it?  Ie no merge policy should be
allowed to leave segments in other dirs (= at inconsistent index) at
point of commit.

> Perhaps this is sufficient, but not necessary? I see it as simpler
> just to have the merge policy (abstractly) generate a set of
> non-conflicting merges and let someone else worry about scheduling
> them.

I like that idea :)  It fits well w/ the stateless API.  Ie, merge
policy returns all possible merges and "someone above" takes care of
scheduling them.

> > But, providing just a single concurrent merge already gains us
> > concurrency of merging with adding of docs.
> I'm worried about when you start the leftmost merge, that, say, is
> going to take a day. With a steady influx of docs, it's not going to
> be long before you need another merge and if you have only one
> thread, you're going to block for the rest of the day. You've bought
> a little concurrency, but it's the almost day-long block I really
> want to avoid.

Ahh ... very good point.  I agree.

> With a log-like policy, I think it's feasible to have logN
> threads. You might not want them all doing disk i/o at the same
> time: you'd want to prioritize threads on the small merges and/or
> suspend large merge threads.  The speed with which the larger merge
> threads can vary when other merges are taking place, you just have
> to not stop them and start over.

Agreed: CMP should do this.

> > Right, the LUCENE-845 merge policy doesn't look @ the return
> > result of "merge".  It just looks at the newly created
> > SegmentInfos.
> Yeah. My thinking was this would be tweaked. If merger.merge returns
> a valid number of docs, it could recurse as it does. If merger.merge
> returned -1 (which CMP does), it would not recurse but simply
> continue the loop.

Hmm.  This means each merge policy must know whether it's talking to
CMP or IndexWriter underneith?  With the stateless approach this
wouldn't happen.

> > Hmmmm, in fact, I think your CMP wrapper would not work with the
> > merge policy in LUCENE-845, right?  Ie, won't it will just recurse
> > forever?  So actually I don't see how your CMP (using the current
> > API) can in general safely "wrap" around a merge policy w/o
> > breaking things?
> I think it's safe, just not concurrent. The recursion would generate
> the same set of segments to merge and CMP would make the second call
> block (abstractly, anyway: it actually throws an exception that
> unwinds the stack and causes the call to start again from the top
> when the conflicting merge finishes).

Oh I see...  that's kind of sneaky (planning on using exceptions to
abort a merge requested by the policy).  I think the stateless
approach would be cleaner here.

> > But, if you lock on IndexWriter, what about apps that use multiple
> > threads to add documents and but don't use CMP?  When one thread
> > gets tied up merging, you'll then block on the other synchronized
> > methods?  And you also can't flush from other threads either?  I
> > think flushing a new segment should be allowed to run concurrently
> > with the merge?
> I'm not sure I'm following this. That's what happens now, right? Are
> you trying to get more concurrency then there is now w/o using CMP?
> I certainly haven't been trying to do that.

True, this is something new.  But since you're already doing the work
to allow a merge to run in the BG without blocking adding of docs,
flushing, etc, wouldn't this come nearly for free?  Actually I think
all that's necessary, regardless of sync'ing on IndexWriter or
SegmentInfos is to move the "if (triggerMerge)" out of the
synchronized method/block.

> > I guess I don't see the reason to synchronize on IndexWriter
> > instead of segmentInfos.
> I looked at trying to make IW work when a synchronization of IW
> didn't imply a synchronization of segmentInfos. It's a very, very
> heavily used little data structure. I found it very hard to convince
> myself I could catch all the places locks would be required. And at
> the same time, I seemed to be able to do everything I needed with IW
> locking.

Well, eg flush() now synchronizes on IndexWriter: we don't want 2
threads doing this at once.  But, the touching of segmentInfos inside
flush (to add the new SegmentInfo) is a tiny fleeting event (like
replace) and so you would want segmentInfos to be free to change while
the flushing was running (eg by a BG merge that has finished).

> Hmmm ... I guess our approaches are pretty different. If you want to
> take a stab at this ...

OK I will try to take a rough stab a the stateless approach....

> Factor merge policy out of IndexWriter
> --------------------------------------
>                 Key: LUCENE-847
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.patch.txt,
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message