lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Li (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Thu, 30 Aug 2007 22:34:30 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523957
] 

Ning Li commented on LUCENE-847:
--------------------------------

> True, but I was thinking CMPW could be an exception to this rule.  I
> guess I would change the rule to "simple merge policies don't have to
> run their own merges"?

:) Let's see if we have to make that exception.

> Good point...  I think I could refactor this so that cascading logic
> lives entirely in one place IndexWriter.

Another problem of the current cascading in CMPW.MergeThread is, if multiple candidate merges
are found, all of them are added to IndexWriter.mergingSegments. But all but the first should
be removed because only the first merge is carried out (thus removed from mergeSegments after
the merge is done).

How do you make cascading live entirely in IndexWriter? Just removing cascading from CMPW.MergeThread
has one drawback. For example, segment sizes of an index are: 40, 20, 10, buffer size is 10
and merge factor is 2. A buffer full flush of 10 will trigger merge of 10 & 10, then cascade
to 20 & 20, then cascade to 40 & 40. CMPW without cascading will stop after 10 &
10 since IndexWriter.maybeMerge has already returned. Then we have to wait for the next flush
to merge 20 & 20.

> How would this be used?  Ie, how would one make an IndexWriter that
> uses the ConcurrentMerger?  Would we add expert methods
> IndexWriter.set/getIndexMerger(...)?  (And presumably the mergePolicy
> is now owned by IndexMerger so it would have the
> set/getMergePolicy(...)?)
> 
> Also, how would you separate what remains in IW vs what would be in
> IndexMerger?

Maybe Merger does and only does merge (so IndexWriter still owns MergePolicy)? Say, base class
Merger.merge simply calls IndexWriter.merge. ConcurrentMerger.merge creates a merge thread
if possible. Otherwise it calls super.merge, which does non-concurrent merge. IndexWriter
simply calls its merger's merge instead of its own merge. Everything else remains in IndexWriter.


1
> Hmm ... you're right.  This is a separate issue from merge policy,
> right?  Are you proposing buffering deletes in DocumentsWriter
> instead?

Yes, this is a separate issue. And yes if we consider DocumentsWriter as staging area.

2
> Good catch!  How to fix?  One thing we could do is always use
> SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
> level.  This way using the default 'equals' (same instance) would
> work.  Or we could establish identity (equals) of a SegmentInfo as
> checking if the directory plus segment name are equal?  I think I'd
> lean to the 2nd option....

I think the 2nd option is better.

3
> Hmmm yes.  In fact I think we can remove synchronized from optimize
> altogether since within it we are synchronizing(this) at the right
> places?  If more than one thread calls optimize at once, externally,
> it is actually OK: they will each pick a merge that's viable
> (concurrently) and will run the merge, else return once there is no
> more concurrency left.  I'll add a unit test that confirms this.

That seems to be the case. The fact that "the same merge spec will be returned without changes
to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds merges which may not be
eligible; but CMPW checks for eligibility when looking for candidate merges. Maybe we should
unify the behaviour? BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility either.

4
> Well, useCompoundFile(...) is given a single newly flushed segment and
> should decide whether it should be CFS.  Whereas
> useCompoundDocStore(...) is called when doc stores are flushed.  When
> autoCommit=false, segments can share a single set of doc stores, so
> there's no single SegmentInfo to pass down.

The reason I asked is because none of them are used right now. So they might be used in the
future?

> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>             Fix For: 2.3
>
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.patch.txt,
LUCENE-847.take3.patch, LUCENE-847.take4.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message