lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Thu, 16 Aug 2007 16:27:31 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520293
] 

Steven Parkes commented on LUCENE-847:
--------------------------------------

	One new small item: you've added a "public void merge()" to
	IndexWriter so that people can externally kick off a merge request,
	which is good I think.

	But, is it really necessary to flush here?  It would be better to not
	flush so that users then have two separate methods (flush() and
	merge()) to do each function independently.

That makes sense.

Note that merge() was added not for users (which I have no strong opinion about) but so that,
potentially, CMP can check again for merges when a set of merge threads completes, i.e., cascade.

	I think instead we should leave the methods, not deprecated, as
	convenience (sugar) methods.  Simple things should be simple; complex
	things should be possible.

I think this argues for a LegacyMergePolicy interface again, then? If we change the default
merge policy and someone changes their code to use LogDoc for their own purposes, in both
cases the getters/setters should work? So cast to the interface and as long as the merge policy
supports this, the getters/setters work (unless the merge policy decides to throw within),
otherwise the getters/setters throw? 

	Uh, no: when someone calls optimize that means it really must be done,
	right?  So "optimize" is the right name I think.

Yeah, but it might do nothing. Just as merge might do nothing.

	Can you factor this out, eg add a private method
	"getLogDocMergePolicy(String reason)"

Sure.

	Looks good, thanks.  Can you add javadocs (w/ params) for both of
	these new methods?

Sure.

	Though, this raises the tricky question of index
	consistency ...

Definitely. I'm still trying to understand all the subtleties here.

	IndexWriter commits the new segments file right after
	mergePolicy.merge returns ... so for CMP we suddenly have an unusable
	index (as seen by an IndexReader).

How so? I figured that after mergePolicy.merge returns, in the case of CMP with an ongoing
merge, segmentInfos won't have changed at all. Is that a problem?

I thought the issue would be on the other end, where the concurrent merge finishes and needs
to update segmentInfos.

	Maybe it's too ambitious to allow merges of segments from other
	directories to run concurrently?

Yeah, that might be the case. At least as a default?

	I would consider it a hard error in IndexWriter if after calling
	mergePolicy.merge from any of the addIndexes*, there remain segments
	in other directories.  I think we should catch this & throw an
	exception?

It would be easy enough for CMP to block in this case, rather than returning immediately.
Wouldn't that be better? And I suppose it's possible to imagine an API on CMP for specifying
this behavior?

	I opened a separate issue LUCENE-982 to track this.

I think this is good. I think it's an interesting issue but not directly related to the refactor?

	Although ... do you think we need need some way for merge policy to
	state where the new segment should be inserted into SegmentInfos?

Right now I assumed it would replace the left most-segment.

Since I don't really know the details of what such a merge policy would like, I don't really
know what it needs.

If you've thought about this more, do you have a suggestion? I suppose we could just add an
int. But, then again, I'd do that as a separate function, leaving the original available,
so we can do this later, completely compatibly?

	Hmmm, does CMP block on close while it joins to any running merge
	threads?

Yeah, at least in my sandbox.

	How can the user close IndexWriter and abort the running
	merges?  I guess CMP would provide a method to abort any running
	merges, and user would first call that before calling
	IndexWriter.close?

I hadn't really thought about this but I can see that should be made possible. It's always
safe to abandon a merge so it should be available, for fast, safe, and clean shutdown.

	True, LogDoc as it now stands would never exploit concurrency (it will
	always return the highest level that needs merging).  But, we could
	relax that such that if ever the lowest level has > 2*mergeFactor
	pending segments to merge then we select the 2nd set.

Okay. But it will always return that? Still doesn't sound concurrent?

The thing is, the serial merge policy has no concept of concurrent merges, so if the API is
always to select the best merge, until a pervious merge finishes, it will always return that
as the best merge.

Concurrent is going to require, by hook or by crook, that a merge policy be able to generate
a set of non-conflicting merges, is it not?

I think the LUCENE-845 merge policy does this now, given that CMP gathers up the merge calls.
I'm not sure the current LUCENE-847 merge policy does (I'd have to double check) because it
sometimes will try to use the result of the current merge in the next merge. The LUCENE-845
merge doesn't try to do this which is a(n) (inconsequential) change?

	This is another benefit of the simplified API: MergePolicy.maybeMerge
	would only be called with a lock already acquired (by IndexWriter) on
	the segmentInfos.

Do you really mean a lock on segmentInfos or just the lock on IndexWriter? I'm assuming the
latter and I think this is the case for both API models.

I don't think it's feasible to have a lock on segmentInfos separately. Only IndexWriter should
change segmentInfos and no code should try to look at segmentInfos w/o being called via an
IW synch method.

This does imply that CMP has to copy any segmentInfos data it plans to use during concurrent
merging, since the IW lock is not held during these periods. Then, when the merge is done,
segmentInfos is updated in IndexWriter via a synch call to IW#replace.

This means IW#segmentInfos can change while a merge is in progress and this has to be accounted
for. That's what I'm walking through now.


> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.patch.txt,
LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message