lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Fri, 31 Aug 2007 08:32:30 GMT


Michael McCandless commented on LUCENE-847:

> > Good point...  I think I could refactor this so that cascading logic
> > lives entirely in one place IndexWriter.
> Another problem of the current cascading in CMPW.MergeThread is, if
> multiple candidate merges are found, all of them are added to
> IndexWriter.mergingSegments. But all but the first should be removed
> because only the first merge is carried out (thus removed from
> mergeSegments after the merge is done).

You're right -- I'm only doing the first non-conflicting merge in
CMPW (but then not releasing the rest of them).  I think this would be
fixed by having cascading logic only in IndexWriter.

> How do you make cascading live entirely in IndexWriter? Just
> removing cascading from CMPW.MergeThread has one drawback.  For
> example, segment sizes of an index are: 40, 20, 10, buffer size is
> 10 and merge factor is 2. A buffer full flush of 10 will trigger
> merge of 10 & 10, then cascade to 20 & 20, then cascade to 40 &
> 40. CMPW without cascading will stop after 10 & 10 since
> IndexWriter.maybeMerge has already returned. Then we have to wait
> for the next flush to merge 20 & 20.

Oh, I would remove from CMPW and add then add it into IndexWriter (so
the scenario above would cascade normally).  Meaning, IndexWriter,
upon completing a merge, would always consult the policy for whether
the completed merge has now enabled any new merges.

This is somewhat messy though (with CMPW as a MergePolicy) because
then findCandidateMerges would need to know if it was being called
(due to cascading) under one of its own threads and if so act
differently.  Another good reason to make it a separate Merger

> > How would this be used?  Ie, how would one make an IndexWriter
> > that uses the ConcurrentMerger?  Would we add expert methods
> > IndexWriter.set/getIndexMerger(...)?  (And presumably the
> > mergePolicy is now owned by IndexMerger so it would have the
> > set/getMergePolicy(...)?)
> > 
> > Also, how would you separate what remains in IW vs what would be
> > in IndexMerger?
> Maybe Merger does and only does merge (so IndexWriter still owns
> MergePolicy)? Say, base class Merger.merge simply calls
> IndexWriter.merge. ConcurrentMerger.merge creates a merge thread if
> possible. Otherwise it calls super.merge, which does non-concurrent
> merge. IndexWriter simply calls its merger's merge instead of its
> own merge. Everything else remains in IndexWriter.

OK I will test out this approach.

> > Hmm ... you're right.  This is a separate issue from merge policy,
> > right?  Are you proposing buffering deletes in DocumentsWriter
> > instead?
> Yes, this is a separate issue. And yes if we consider
>  DocumentsWriter as staging area.

I will open new issue.

> > Good catch!  How to fix?  One thing we could do is always use
> > SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
> > level.  This way using the default 'equals' (same instance) would
> > work.  Or we could establish identity (equals) of a SegmentInfo as
> > checking if the directory plus segment name are equal?  I think
> > I'd lean to the 2nd option....
> I think the 2nd option is better.

I'll take this approach.

> > Hmmm yes.  In fact I think we can remove synchronized from
> > optimize altogether since within it we are synchronizing(this) at
> > the right places?  If more than one thread calls optimize at once,
> > externally, it is actually OK: they will each pick a merge that's
> > viable (concurrently) and will run the merge, else return once
> > there is no more concurrency left.  I'll add a unit test that
> > confirms this.
> That seems to be the case.

I'll add unit test to confirm.

> The fact that "the same merge spec will be returned without changes
> to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds
> merges which may not be eligible; but CMPW checks for eligibility
> when looking for candidate merges. Maybe we should unify the
> behaviour?

Not quite following you here... not being eligible because the merge
is in-progress in a thread is something I think any given MergePolicy
should not have to track?  Once I factor out CMPW as its own Merger
subclass I think the eligibility check happens only in IndexWriter?

> BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility
> either.

Rename to/from what?  (It is currently called MergePolicy.optimize).
IndexWriter steps through the merges and only runs the ones that do
not conflict (are eligible)?

> > Well, useCompoundFile(...) is given a single newly flushed segment
> > and should decide whether it should be CFS.  Whereas
> > useCompoundDocStore(...) is called when doc stores are flushed.
> > When autoCommit=false, segments can share a single set of doc
> > stores, so there's no single SegmentInfo to pass down.
> The reason I asked is because none of them are used right now. So
> they might be used in the future?

Both of these methods are now called by IndexWriter (in the patch),
upon flushing a new segment.

> Factor merge policy out of IndexWriter
> --------------------------------------
>                 Key: LUCENE-847
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>             Fix For: 2.3
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.patch.txt,
LUCENE-847.take3.patch, LUCENE-847.take4.patch, LUCENE-847.txt
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message