lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Thu, 16 Aug 2007 23:21:31 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520408
] 

Steven Parkes commented on LUCENE-847:
--------------------------------------

	I don't think so: I think if someone changes the merge policy to
	something else, it's fine to require that they then do settings
	directly through that merge policy.

You're going to want to change the default merge policy, right?  So you're going to change
the hard cast in IW to that policy? So it'll fail for anyone that wants to just getMergePolicy
back to the old policy?

If that's the case, I'm going to keep those tests the way they are because when you do change
the policy, I'm assuming you'll keep many of them, just add the manual setMergePolicy(), and
they'll need to have those casts put back in?

Maybe we just put it in MergePolicy interface and let them throw (e.g., via MergePolicyBase)
if called on an unsupported merge policy? That's moving from compile time checking to run
time checking, but ... 

	This is inside addIndexes that we're talking about.

Ah. Right.

	I think we shouldn't allow any mergePolicy to
	leave the index inconsistent (failing to copy over segments from other
	directories).

That makes sense to me. CMP could enforce this, even in the case of concurrent merges.

	No, after another N (= mergeFactor) flushes, it would return a new
	suggested merge.

Okay. I think I'm following you here.

Here's what I understand: in your model, (1) each call to merge will only ever generate one
merge thread (regardless of how many levels might be full) and (2) you can get concurrency
out of this as long as you consider a level "merge worthy" as different from "full", i.e.,
blocking).

You did say  

> > But, we
> > could relax that such that if ever the lowest level has >
> > 2*mergeFactor pending segments to merge then we select the 2nd
> > set.

And I think you'd want to modify that to select the lowest sufficiently over subscribed level,
not just the lowest level if it's oversubscribed?

Perhaps this is sufficient, but not necessary? I see it as simpler just to have the merge
policy (abstractly) generate a set of non-conflicting merges and let someone else worry about
scheduling them.

	But, providing just a single concurrent merge already gains us
	concurrency of merging with adding of docs.

I'm worried about when you start the leftmost merge, that, say, is going to take a day. With
a steady influx of docs, it's not going to be long before you need another merge and if you
have only one thread, you're going to block for the rest of the day. You've bought a little
concurrency, but it's the almost day-long block I really want to avoid.

With a log-like policy, I think it's feasible to have logN threads. You might not want them
all doing disk i/o at the same time: you'd want to prioritize threads on the small merges
and/or suspend large merge threads.  The speed with which the larger merge threads can vary
when other merges are taking place, you just have to not stop them and start over. 

	Right, the LUCENE-845 merge policy doesn't look @ the return result of
	"merge".  It just looks at the newly created SegmentInfos.

Yeah. My thinking was this would be tweaked. If merger.merge returns a valid number of docs,
it could recurse as it does. If merger.merge returned -1 (which CMP does), it would not recurse
but simply continue the loop.

	Hmmmm, in fact, I think your CMP wrapper would not work with the merge
	policy in LUCENE-845, right?  Ie, won't it will just recurse forever?
	So actually I don't see how your CMP (using the current API) can in
	general safely "wrap" around a merge policy w/o breaking things?

I think it's safe, just not concurrent. The recursion would generate the same set of segments
to merge and CMP would make the second call block (abstractly, anyway: it actually throws
an exception that unwinds the stack and causes the call to start again from the top when the
conflicting merge finishes).

	But, if you lock on IndexWriter, what about apps that use multiple
	threads to add documents and but don't use CMP?  When one thread gets
	tied up merging, you'll then block on the other synchronized methods?
	And you also can't flush from other threads either?  I think flushing
	a new segment should be allowed to run concurrently with the merge?

I'm not sure I'm following this. That's what happens now, right? Are you trying to get more
concurrency then there is now w/o using CMP? I certainly haven't been trying to do that.

	I guess I don't
	see the reason to synchronize on IndexWriter instead of segmentInfos.

I looked at trying to make IW work when a synchronization of IW didn't imply a synchronization
of segmentInfos. It's a very, very heavily used little data structure. I found it very hard
to convince myself I could catch all the places locks would be required. And at the same time,
I seemed to be able to do everything I needed with IW locking.

That said, the code's not done, so ....

	Net/net I'm still thinking we should simplify this API to be
	stateless.  I think there are a number of benefits:

 	 * We would no longer need to add a new IndexMerger interface that
	    adds unecessary complexity to Lucnee (and, make the awkward
	    decisions up front on which IndexWriter fields are allowed to be
	    visible through the interface).

	  * Keep CMP simpler (only top of stack (where I think "macro"
	    concurrency should live), not top and bottom).

	  * Work correctly as wrapper around other merge policies (ie not hit
	    infinite recursion because mergePolicy had naturally assumed that
	    "merge" would have changed the segmentInfos)

	  * Allows locking on segmentInfos (not IndexWriter), and allows
 	   concurrency on multiple threads adding docs even without using
 	   CMP.

Hmmm ... I guess our approaches are pretty different. If you want to take a stab at this ...

> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.patch.txt,
LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message