lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Li" <>
Subject Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Wed, 08 Aug 2007 13:48:25 GMT
On 8/7/07, Steven Parkes (JIRA) <> wrote:
>     [
> Steven Parkes commented on LUCENE-847:
> --------------------------------------
>     Is the separate IndexMerger interface really necessary?
> I wrestled with this, so in the past, it wouldn't have taken much to convince me otherwise.
The reason for the extra interface is I was hoping to discourage or create a little extra
friction for merge policies in terms of looking too much into the internals of IndexWriter.
> As an example, it's not a good idea for merge policies to be able to access IndexWriter#segmentInfos
directly. (That's a case I would like to solve by making segmentInfos private, but I haven't
looked at that completely yet and even beyond that case, I'd still like merge policies to
have a very clean interface with IWs.)
> It feels kinda weird for me to be arguing for constraints since I work mostly in dynamic
languages that have none of this stuff. But it reflects my goal that merge policies simply
be algorithms, not real workers.
> Moreover, I think it may be useful for implementing concurrency (see below).
>     While LogDocMergePolicy does need "maxBufferedDocs", I think,
>     instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
>     through" to the LogDocMergePolicy if that is the merge policy in
>     use (and LogDocMergePolicy should store its own private
>     "minMergeDocs").
> The thing here is that the merge policy has nothing to do with max buffered docs, right?
The code for buffering docs for the first segment is wholly in the IndexWriter. LogDocMergePolicy
happens to need it (in its current incarnation) in order to handle the way the log levels
are computed. This could, of course, be changed. There's nothing that says a merge policy
has to look at these values, just that they're available should the merge policy want to look.
> I guess my idea was that the index writer was responsible for handling the initial segment
(with DocsWriter, if it wants) and also to provide an indication to the merge policies how
it was handling them.
> It's possible to have the merge policy influence the first segment size but that opens
up a lot of issues. Does every merge policy then have to know how to trade between buffering
by doc count and buffering by ram? I was hoping to avoid that.
> I'm not all that happy with this the way it is, but supporting both by-doc and by-ram
is messy but seems necessary. This was my take on making it least messy?
>     In LogDocMergePolicy, it seems like the checking that's done for
>     whether a SegmentInfo is in a different directory, as well as the
>     subsequent copy to move it over to the IndexWriter's directory,
>     should not live in the MergePolicy?
> I see two parts to this.
> First, I hesitate to make it not possible for merge policy to see the directory information,
i.e., remove IndexMerger#getDirectory(). Copying a segment is one way to get it into the current
directory, but merging with other segments does to. A merge policy could decide to go ahead
and merge a bunch of segments that are in other directories rather than just copy them individually.
Taking away getDirectory() removes that option.
> As to how to handle the case where single segments are copied, I guess my main reason
for leaving that in the merge policy would be for simplicity. Seems nicer to have all segment
amalgamation management in one place, rather than some in one class and some in another. Could
be factored into an optional base merge policy for derived classes to use as they might like.
>     The "checkOptimize" method in LogDocMergePolicy seems like it
>     belongs back in IndexWriter: I think we don't want every
>     MergePolicy having to replicate that tricky while condition.
> The reason for not doing this was I can imagine different merge policies having a different
model of what optimize means. I can imagine some policies that would not optimize everything
down to a single segment but instead obeyed a max segment size. But we could factor the default
conditional into an optional base class, as above.
> It is an ugly conditional that there might be better way to formulate, so that policies
don't have to look at the grody details like hasSeparateNorms. But I'd still like the merge
policies to be able to decide what optimize means at a high level.
>     Maybe we could change MergePolicy.merge to take a boolean "forced"
>     which, if true, means that the MergePolicy must now pick a merge
>     even if it didn't think it was time to.  This would let us move
>     that tricky while condition logic back into IndexWriter.
> I didn't follow this. forced == optimize? If not, what does pick a merge mean? Not sure
what LogDoc should do for merge(force=true) if it thinks everything is fine?
>     Also, I think at some point we may want to have an optimize()
>     method that takes an int parameter stating the max # segments in
>     the resulting index.
> I think this is great functionality for a merge policy, but what about just making that
part of the individual merge policy interface, rather than linked to IndexWriter? That was
one goal of making a factored merge policy: that you could do these tweaks without changing
>     This would allow you to optimize down to <=
>     N segments w/o paying full cost of a complete "only one segment"
>     optimize.  If we had a "forced" boolean then we could build such
>     an optimize method in the future, whereas as it stands now it
>     would not be so easy to retrofit previously created MergePolicy
>     classes to do this.
> I haven't looked at how difficult it would be to make LogDoc sufficiently derivable to
allow a derived class to add this tweak. If I could, would it be enough?
>     There are some minor things that should not be committed eg the
>     added "infoStream = null" in IndexFileDeleter.  I typically try to
>     put a comment "// nocommit" above such changes as I make them to
>     remind myself and keep them from slipping in.
> You're right and thanks for the idea. Obvious now.
>     In the deprecated IndexWriter methods you're doing a hard cast to
>     LogDocMergePolicy which gives a bad result if you're using a
>     different merge policy; maybe catch the class cast exception (or,
>     better, check upfront if it's an instanceof) and raise a more
>     reasonable exception if not?
> Agreed.
>     IndexWriter around line 1908 has for loop that has commented out
>     "System.err.println"; we should just comment out/remove for loop
> The whole loop will be gone before commit but I didn't want to delete it yet since I
might need to turn it back on for more debugging.  It should, of course, have a "// nocommit"
>     These commented out synchronized spook me a bit:
>       /* synchronized(segmentInfos) */ {
> This is from my old code. I left it in there as a hint as I work on the concurrent stuff
again. It's only a weak hint, though, because things have changed a lot since that code was
even partially functional. Ignore it. It won't go into the serial patch and anything for LUCENE-870
will have to have its own justification.
>     Can we support non-contiguous merges?  If I understand it
>     correctly, the MergeSpecification can express such a merge, it's
>     just that the current IndexMerger (IndexWriter) cannot execute it,
>     right?  So we are at least not precluding fixing this in the
>     future.
> As far as I've seen so far, there are no barriers to non-contiguous merges. Maybe something
will crop up that is, but in what I've done, I haven't seen any.
>     It confuses me that MergePolicy has a method "merge(...)" -- can
>     we rename it to "maybeMerge(..)" or "checkForMerge(...)"?
> I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of "merge" like
"optimize": make it so / idempotent. But I'm certainly willing to write whatever people find
>     Instead of IndexWriter.releaseMergePolicy() can we have
>     IndexWriter only close the merge policy if it was the one that had
>     created it?  (Similar to how IndexWriter closes the dir if it has
>     opened it from a String or File, but does not close it if it was
>     passed in).
> This precludes
>         iw.setMergePolicy(new MyMergePolicy(...));
>       ...
>         iw.close();
> You're always going to have to
>         MergePolicy mp = new MyMergePolicy(...);
>         iw.setMergePolicy(mp);
>       ...
>       iw.close();
>       mp.close();
> The implementation's much cleaner using the semantics you describe, but I was thinking
it'd be better to optimize for the usability of the common client code case?
>         Well I think the current MergePolicy API (where the "merge" method
>         calls IndexWriter.merge itself, must cascade itself, etc.) makes it
>         hard to build a generic ConcurrentMergePolicy "wrapper" that you could
>         use to make any MergePolicy concurrent (?).  How would you do it?
> I really haven't had time to go heads down on this (the old concurrent merge policy was
a derived class rather than a wrapper class). But I was thinking that perhaps ConurrentMergePolicy
would actually wrap IndexWriter as well as the serial merge policy, i.e., implement IndexMerger
(my biggest argument for IM at this point). But I haven't looked deeply at whether this will
work but I think it has at least a chance.
> I should know more about this is a day or two.
>         I think you can still have state (as instance variables in your
>         class)?  How would this simplification restrict the space of merge
>         policies?
> s/state/stack state/. Yeah, you can always unwind your loops and lift your recursions,
put all that stack state into instance variables. But, well, yuck. I'd like to make it easy
to write simple merge policies and take up the heavy lifting elsewhere. Hopefully there will
be more merge policies than index writers.
>         Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec),
>         just with a separate thread.  And so all synchronization required is
>         still inside IndexWriter (I think?).
> That's my idea.
> The synchronization is still tricky, since parts of segmentInfos are getting changed
at various times and there are references and/or copies of it other places. And as Ning pointed
out to me, we also have to deal with buffered delete terms. I'd say I got about 80% of the
way there on the last go around. I'm hoping to get all the way this time.
>         In fact, if we stick with the current MergePolicy API, aren't you
>         going to have to put some locking into eg the LogDocMergePolicy when
>         concurrent merges might be happening?
> Yes and no. If CMP implements IndexMerger, I think we might be okay? In the previous
iteration, I used derivation so that ConcurrentLogDocMergePolicy derived from the serial version
but had the necessary threading. I agree that a wrapper is better solution if it can be made
to work.
> > Factor merge policy out of IndexWriter
> > --------------------------------------
> >
> >                 Key: LUCENE-847
> >                 URL:
> >             Project: Lucene - Java
> >          Issue Type: Improvement
> >          Components: Index
> >            Reporter: Steven Parkes
> >            Assignee: Steven Parkes
> >         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.txt
> >
> >
> > If we factor the merge policy out of IndexWriter, we can make it pluggable, making
it possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message