lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
Date Tue, 07 Aug 2007 17:55:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210
] 

Steven Parkes commented on LUCENE-847:
--------------------------------------

    Is the separate IndexMerger interface really necessary?

I wrestled with this, so in the past, it wouldn't have taken much to convince me otherwise.
The reason for the extra interface is I was hoping to discourage or create a little extra
friction for merge policies in terms of looking too much into the internals of IndexWriter.

As an example, it's not a good idea for merge policies to be able to access IndexWriter#segmentInfos
directly. (That's a case I would like to solve by making segmentInfos private, but I haven't
looked at that completely yet and even beyond that case, I'd still like merge policies to
have a very clean interface with IWs.)

It feels kinda weird for me to be arguing for constraints since I work mostly in dynamic languages
that have none of this stuff. But it reflects my goal that merge policies simply be algorithms,
not real workers.

Moreover, I think it may be useful for implementing concurrency (see below).

    While LogDocMergePolicy does need "maxBufferedDocs", I think,
    instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
    through" to the LogDocMergePolicy if that is the merge policy in
    use (and LogDocMergePolicy should store its own private
    "minMergeDocs").

The thing here is that the merge policy has nothing to do with max buffered docs, right? The
code for buffering docs for the first segment is wholly in the IndexWriter. LogDocMergePolicy
happens to need it (in its current incarnation) in order to handle the way the log levels
are computed. This could, of course, be changed. There's nothing that says a merge policy
has to look at these values, just that they're available should the merge policy want to look.

I guess my idea was that the index writer was responsible for handling the initial segment
(with DocsWriter, if it wants) and also to provide an indication to the merge policies how
it was handling them.

It's possible to have the merge policy influence the first segment size but that opens up
a lot of issues. Does every merge policy then have to know how to trade between buffering
by doc count and buffering by ram? I was hoping to avoid that.

I'm not all that happy with this the way it is, but supporting both by-doc and by-ram is messy
but seems necessary. This was my take on making it least messy?

    In LogDocMergePolicy, it seems like the checking that's done for
    whether a SegmentInfo is in a different directory, as well as the
    subsequent copy to move it over to the IndexWriter's directory,
    should not live in the MergePolicy?

I see two parts to this.

First, I hesitate to make it not possible for merge policy to see the directory information,
i.e., remove IndexMerger#getDirectory(). Copying a segment is one way to get it into the current
directory, but merging with other segments does to. A merge policy could decide to go ahead
and merge a bunch of segments that are in other directories rather than just copy them individually.
Taking away getDirectory() removes that option.

As to how to handle the case where single segments are copied, I guess my main reason for
leaving that in the merge policy would be for simplicity. Seems nicer to have all segment
amalgamation management in one place, rather than some in one class and some in another. Could
be factored into an optional base merge policy for derived classes to use as they might like.

    The "checkOptimize" method in LogDocMergePolicy seems like it
    belongs back in IndexWriter: I think we don't want every
    MergePolicy having to replicate that tricky while condition.

The reason for not doing this was I can imagine different merge policies having a different
model of what optimize means. I can imagine some policies that would not optimize everything
down to a single segment but instead obeyed a max segment size. But we could factor the default
conditional into an optional base class, as above.

It is an ugly conditional that there might be better way to formulate, so that policies don't
have to look at the grody details like hasSeparateNorms. But I'd still like the merge policies
to be able to decide what optimize means at a high level.

    Maybe we could change MergePolicy.merge to take a boolean "forced"
    which, if true, means that the MergePolicy must now pick a merge
    even if it didn't think it was time to.  This would let us move
    that tricky while condition logic back into IndexWriter.

I didn't follow this. forced == optimize? If not, what does pick a merge mean? Not sure what
LogDoc should do for merge(force=true) if it thinks everything is fine?

    Also, I think at some point we may want to have an optimize()
    method that takes an int parameter stating the max # segments in
    the resulting index.

I think this is great functionality for a merge policy, but what about just making that part
of the individual merge policy interface, rather than linked to IndexWriter? That was one
goal of making a factored merge policy: that you could do these tweaks without changing IndexWriter.

    This would allow you to optimize down to <=
    N segments w/o paying full cost of a complete "only one segment"
    optimize.  If we had a "forced" boolean then we could build such
    an optimize method in the future, whereas as it stands now it
    would not be so easy to retrofit previously created MergePolicy
    classes to do this.

I haven't looked at how difficult it would be to make LogDoc sufficiently derivable to allow
a derived class to add this tweak. If I could, would it be enough?

    There are some minor things that should not be committed eg the
    added "infoStream = null" in IndexFileDeleter.  I typically try to
    put a comment "// nocommit" above such changes as I make them to
    remind myself and keep them from slipping in.

You're right and thanks for the idea. Obvious now.

    In the deprecated IndexWriter methods you're doing a hard cast to
    LogDocMergePolicy which gives a bad result if you're using a
    different merge policy; maybe catch the class cast exception (or,
    better, check upfront if it's an instanceof) and raise a more
    reasonable exception if not?

Agreed.

    IndexWriter around line 1908 has for loop that has commented out
    "System.err.println"; we should just comment out/remove for loop

The whole loop will be gone before commit but I didn't want to delete it yet since I might
need to turn it back on for more debugging.  It should, of course, have a "// nocommit" comment.

    These commented out synchronized spook me a bit:

      /* synchronized(segmentInfos) */ {

This is from my old code. I left it in there as a hint as I work on the concurrent stuff again.
It's only a weak hint, though, because things have changed a lot since that code was even
partially functional. Ignore it. It won't go into the serial patch and anything for LUCENE-870
will have to have its own justification.

    Can we support non-contiguous merges?  If I understand it
    correctly, the MergeSpecification can express such a merge, it's
    just that the current IndexMerger (IndexWriter) cannot execute it,
    right?  So we are at least not precluding fixing this in the
    future.

As far as I've seen so far, there are no barriers to non-contiguous merges. Maybe something
will crop up that is, but in what I've done, I haven't seen any.

    It confuses me that MergePolicy has a method "merge(...)" -- can
    we rename it to "maybeMerge(..)" or "checkForMerge(...)"?

I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of "merge" like "optimize":
make it so / idempotent. But I'm certainly willing to write whatever people find clearest.


    Instead of IndexWriter.releaseMergePolicy() can we have
    IndexWriter only close the merge policy if it was the one that had
    created it?  (Similar to how IndexWriter closes the dir if it has
    opened it from a String or File, but does not close it if it was
    passed in).

This precludes

	iw.setMergePolicy(new MyMergePolicy(...));
      ...
	iw.close();

You're always going to have to

	MergePolicy mp = new MyMergePolicy(...);
	iw.setMergePolicy(mp);
      ...
      iw.close();
      mp.close();

The implementation's much cleaner using the semantics you describe, but I was thinking it'd
be better to optimize for the usability of the common client code case?
	
	Well I think the current MergePolicy API (where the "merge" method
	calls IndexWriter.merge itself, must cascade itself, etc.) makes it
	hard to build a generic ConcurrentMergePolicy "wrapper" that you could
	use to make any MergePolicy concurrent (?).  How would you do it?

I really haven't had time to go heads down on this (the old concurrent merge policy was a
derived class rather than a wrapper class). But I was thinking that perhaps ConurrentMergePolicy
would actually wrap IndexWriter as well as the serial merge policy, i.e., implement IndexMerger
(my biggest argument for IM at this point). But I haven't looked deeply at whether this will
work but I think it has at least a chance.

I should know more about this is a day or two.

	I think you can still have state (as instance variables in your
	class)?  How would this simplification restrict the space of merge
	policies?

s/state/stack state/. Yeah, you can always unwind your loops and lift your recursions, put
all that stack state into instance variables. But, well, yuck. I'd like to make it easy to
write simple merge policies and take up the heavy lifting elsewhere. Hopefully there will
be more merge policies than index writers.

	Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec),
	just with a separate thread.  And so all synchronization required is
	still inside IndexWriter (I think?).

That's my idea.

The synchronization is still tricky, since parts of segmentInfos are getting changed at various
times and there are references and/or copies of it other places. And as Ning pointed out to
me, we also have to deal with buffered delete terms. I'd say I got about 80% of the way there
on the last go around. I'm hoping to get all the way this time.

	In fact, if we stick with the current MergePolicy API, aren't you
	going to have to put some locking into eg the LogDocMergePolicy when
	concurrent merges might be happening?

Yes and no. If CMP implements IndexMerger, I think we might be okay? In the previous iteration,
I used derivation so that ConcurrentLogDocMergePolicy derived from the serial version but
had the necessary threading. I agree that a wrapper is better solution if it can be made to
work.

> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, making it
possible for apps to choose a custom merge policy and for easier experimenting with merge
policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message