lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2755) Some improvements to CMS
Date Wed, 24 Nov 2010 14:04:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935327#action_12935327
] 

Shai Erera commented on LUCENE-2755:
------------------------------------

There are several points addressed by this issue:
* Refactor IW, MS and MP so that MS pulls merges directly from MP, instead from IW.
* Rewrite CMS to take advantage of ThreadPoolExecutor instead of managing the threads on its
own, in addition to using a blocking queue instead of us coding the blocking directly.
* One should be able to reuse CMS across several IndexWriters, which is not possible today,
to e.g., allow one controlling the total # of merges happening in the JVM.
* Merges should be sorted by their size in bytes and not by their # of docs -- or actually,
merges should be sorted by a criteria the MP determines.

All the while maintaining the following requirements, in no particular order:
* MergeThreads' priority needs to be controllable - this is the current behavior of CMS that
we'd like to keep.
* When there are too many merges to execute, small ones should be preferred to large ones,
and we need the ability to pause large merges in favor of small ones.
* The user needs to be able to control:
*# The max number of running merges
*# The max number of merges, above which scheduling more merges should be blocked.
* We should keep the sync() API, which lets the user wait for all scheduled merges to complete.
* The MP needs to be aware of the type of merges that are requested (regular, optimize, expunge).
* The user should be able to fast-close the index, aborting all merges (running and pending).
* If there are cascading merges (i.e., a result of several other merges), they should all
be executed following the call to MS.merge() -- that is, it could be that CMS itself, or its
MergeThreads will encounter merges not returned by MP at first, but as a subsequent round
due to changes done to the index.

After investigating the code and going over the proposed plan, I feel that we cannot accomplish
all that we'd like to do, given the above requirements. And just to be clear, those are not
*any* application requirements, but the default ones we'd like Lucene to offer OOtB. One can
still write a MS/MP which doesn't guarantee all that.

Using ThreadPoolExecutor looks like will only complicate CMS instead of simplifying it:
# Because of all the requirements I've listed above, we'd need to trick TPE into starting
more threads than we intend to run, while we pause some of them.
# In order to set threads' priorities, we need to write our own ThreadFactory to pass to TPE,
and inside it keep track of the allocated Threads and those that still run, so that we can
control their priority.
# There's no trivial way to impl sync(), as TPE does not provide API we can rely on (e.g.
checking when all threads are done). There are ways to impl that, using Semaphore (see next
bullet).
# In order to block the app on too many merges being scheduled, we'd need to use a Semaphore,
because even if we use a BlockingQueue w/ TPE, the submit() call won't block, but simply reject
the item.
# In order to execute cascading merges as well, we'd need the TPE threads (or the Runnable
we submit) to poll getNextMerge() until null is returned. This breaks the concepts of Executors,
where a task is submitted, done and that's it. I wouldn't mind if we did it still though.

All of these make me think that TPE, at this point at least, is not suitable for CMS. While
it's doable, it's not going to make CMS code any simpler. And it looks more as if we'll enforce
TPE in CMS, than it is really useful.

Refactoring IW, MS and MP seems to not simplify anything, really:
# We'd still need IW telling MP which merges are needed (optimize, regular or expunge), so
the three findMergesForXYZ will need to remain.
# The proposal will add a getNextMerge() to MP, instead of IW, which IMO will only complicate
matters for MP implementers. E.g., what should MP do if findRegularMerges was called, then
getNext() was called and then findOptimizeMerges is called? It's not a critical decision we
leave in the MP developers, but IMO it's unnecessary. Today MP is a stateless object - it
receives SegmentInfos and returns a MergeSpec. It doesn't need to 'remember' anything. But
if we move the getNextMerge() to it, we make it stateful, for no good reasons
# We don't really take IW outside the loop really - it would still need to instruct MP which
merges to 'prepare', so that MS can take.
# We'd need to introduce an abort/cancel() API on MS, which adds another responsibility for
MS, but doesn't remove much from IW.

All in all, I don't think this refactoring simplifies IW-MS-MP communication a lot or allows
custom MS and MPs have more control over what's going on. IW, as the mediator, is nothing
but the mediator, which happens to know (as it should) which merges finish and if the state
of the index changed, ask MP for more merges. That that it keeps track of pending and running
merges, to me, is not a big deal. In fact, due to IW.waitForMerges() it either must continue
to keep track of pending/running merges, or we add a sync() API to every MS.

Fixing CMS to allow sharing across multiple IndexWriter instances is important IMO, so I'll
look into fixing it. To allow for MP dependent sort, I suggest we add to MP a getMergesComparator
and use it in CMS. The default (to not break back-compat) can be ByDocsComparator and we override
it in the existing MPs.

I must say that the more I went over the details, the more I was convinced that the proposals
made will change the current API for no great benefits. But I may have looked too deeply into
the impl, that I lost the ability to think about it 'from above' - so I'd appreciate if someone
can go over what I wrote and offer comments :-).

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got me to read
CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads
taking merges from the IndexWriter until they are exhausted, and only then that blocked merge
will run. I think it's unnecessary that that merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default
MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore,
I think we should switch the default impl. There are two ways to make it extensible, if we
want:
> ** Have an overridable member/method in CMS that you can extend and override - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs,
calibrate deletes etc.). Better, but will need to tap into several places in the code, so
more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to read and
follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message