lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Tue, 14 Aug 2007 21:49:31 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12519790
] 

Steven Parkes commented on LUCENE-847:
--------------------------------------

	Are you going to fix all unit tests that call the now-deprecated APIs?

Yeah. Thanks for the reminder.

As to the IndexWriter vs. IndexMerger issue, I still think having the interface is useful
if not only that it makes my testing much easier. I have a MockIndexMerger that implements
only the functions in the interface and therefore I can test merge policies without creating
a writer. For me this has been a big win ...

	Exactly: these settings decide when a segment is flushed, so, why put
	them into IndexMerger interface?  They shouldn't have anything to with
	merging; I think they should be removed.

	For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that
	does not use maxBufferedDocs.

I can see that one could write a merge policy that didn't have any idea of how the initial
buffering was done, but I worry about precluding it. Maybe the LUCENE-845 patch will show
a strong enough pattern to believe no merge policies will need it?

	I think factoring into a base class is an OK solution, but, it
	shouldn't be MergePolicy's job to remember to call this final "move
	any segments in the wrong directory over" code.  As long as its in one
	place and people don't have to copy/paste code between MergePolicy
	sources.

In the case of concurrent merges, I think this gets more complicated. When do you do those
directory copies? I think you can't do them at the return from the merge policy because the
merge policy may want to do them, but later.

I don't think IndexWriter has enough information to know when the copies need to done. Doubly
so if we have concurrent merges?

I still stand by it should be the merge policy making the choice. You could have the code
in IndexWriter too, but then there'd be duplicate code. To put the code only in IndexWriter
removes the choice from the merge policy.

	I think there
	should in fact be a default optimize() in the base class that does
	what current IndexWriter now does so that a MergePolicy need not
	implement optimize at all.

It'd be nice, but I don't know how to do it: the merge factor is not generic, so I don't know
how to implement the loop generically.

Ah ... I see: with your forced merge ... hmmm.

	No, forced would mean the merge policy must do a merge; whereas,
	normally, it's free not to do a merge until it wants to.

Hmmmm ...

I think adding a forced merge concept here is new ... If it's simply to support optimize,
I'm not sure I find it too compelling. LogDoc as it stands uses different algorithms for incremental
merges and optimize, so there's not too much of a concept of forced merges vs. optional merges
to be factored out. So I guess I'm not seeing a strong compelling case for creating it?

	Well, it's sort of awkward if you want to vary that max # segments.
	Say during the day you optimize down to 15 segments every time you
	update the index, but then at night you want to optimize down to 5.
	If we don't add method to IndexWriter you then must have instance var
	on your MergePolicy that you set, then you call optimize.  It's not
	clean since really it should be a parameter.

Well, I don't know if I buy the argument that it should be a parameter. The merge policy has
lots of state like docs/seg. I don't really see why segs/optimize is different.

My main reason for not wanting put this into IndexWriter is then every merge policy must support
it.

	Wait: there is a barrier, right?  In IndexWriter.replace we don't do
	the right thing with non-contiguous merges?

Yeah, I meant that I'm not aware of any barriers except fixing IndexWriter#replace, in other
words, I'm not aware of any other places where non-contiguity would cause a failure.

	I'm not wed to "maybeMerge()" but I really don't like "merge" :)  It's
	overloaded now.

	EG IndexMerger interface has a method called "merge" that is named
	correctly because it will specifically go a do the requested merge.
	So does IndexWriter.

	Then, you have other [overloaded] methods in LogDocMergePolicy called
	"merge" that are named appropriately (they will do a specific merge).

	How about "checkForMerges()"?

I don't find it ambiguous based on class and argument type. It's all personal, of course.

I'd definitely prefer maybe over checkFor because that sounds like a predicate.

	Maybe we could add a "setMergePolicy(MergePolicy policy, boolean
	doClose)" and default doClose to true?

That sounds good.

	Finally: does MergePolicy really need a close()?

I think so. The concurrent merge policy maintains all sorts of state.

	I don't see how it can work (building the generic concurrency wrapper
	"under" IndexMerger) because the MergePolicy is in "serial control",
	eg, when it wants to cascade merges.  How will you return that thread
	back to IndexWriter?

So this is how it looks now: the concurrent merge policy is both a merge policy and an index
merger. The serial merge policy knows nothing about it other than it does not get IndexWriter
as its merge.

The index writer wants its merge, so it does it merge/maybeMerge call on the concurrent merge
policy. The CMP calls merge on the serial policy, but substitutes itself for the merger rather
than IndexWriter.

The serial merge policy goes on its merry way, looking for merges to do (in the current model,
this is a loop; more on that in a minute). Each time it has a subset of segments to merge,
it calls merger.merge(...).

At this point, the concurrent merge policy takes over again. It looks at the segments to be
merged and other segments being processed by all existing merge threads and determines if
there's a conflict (a request to merge a segment that's currently in a merge). If there's
no conflict, it starts a merge thread and calls IndexWriter#merge on the thread. The original
calling thread returns immediately. (I have a few ideas how to handle conflicts, the simplest
of which is to wait for the conflicting merge and the restart the serial merge, e.g., revert
to serial).

This seems to work pretty well, so far. The only difference in API for the serial merges is
that the merge operation can't return the number of documents in the result (since it isn't
known how many docs will be deleted).   

	With concurrency wrapper "on the top" it's able to easily take a merge
	request as returned by the policy, kick it off in the backrground, and
	immediately return control of original thread back to IndexWriter.

What I don't know how to do with this is figure out how to do a bunch of merges. Lets say
I have two levels in LogDoc that are merge worthy. If I call LogDoc, it'll return the lower
level. That's all good. But what about doing the higher level in parallel? If I call LogDoc
again, it's going to return the lower level again because it knows nothing about the current
merge going on.

LogDoc already does things in a loop: it's pretty much set up to call all possible merges
at one time (if they return immediately).

It would be possible to have it return a vector of segmentInfo subsets, but I don't see the
gain (and it doesn't work out as well for my putative conflict resolution).

	have all necessary locking for SegmentInfos inside
	IndexWriter

This was a red-herring on my part. All the "segmentInfos locking" has always been in IndexWriter.
That's note exactly sufficient. The fundamental issue is that IndexWriter#merge has to operate
without a lock on IndexWriter. At some point, I was thinking that meant it would have to lock
SegmentInfos but that's ludicrous, actually. It's sufficient for IndexWriter#replace to be
synchronized.

	If CMP implements IndexMerger you must have locking inside any
	MergePolicy that's calling into CMP?

No. CMP does it's own locking (for purposes of thread management) but the serial merge policies
no nothing of this (and they can expect to be called synchronously).

> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message