lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Earwin Burrfoot (JIRA)" <>
Subject [jira] Commented: (LUCENE-2755) Some improvements to CMS
Date Wed, 24 Nov 2010 17:09:14 GMT


Earwin Burrfoot commented on LUCENE-2755:

bq. Refactor IW, MS and MP so that MS pulls merges directly from MP, instead from IW.
Directly or through IW - this is not important. Important point is pulling merges one-by-one,
when you have the resources to execute them.

bq. Rewrite CMS to take advantage of ThreadPoolExecutor instead of managing the threads on
its own, in addition to using a blocking queue instead of us coding the blocking directly.
bq. Using ThreadPoolExecutor looks like will only complicate CMS instead of simplifying it:
I ended up with same conclusion, while taking my first stabs. But for different reasons.
The philosphy of Executors is that you schedule (push) a number of tasks, and then some magic
black box runs them for you, resolving threading issues itself.
My suggestion requires pulling tasks when computing resources become available, and that doesn't
map on scheduling model at all.
All priority/pausing/breaking issues are largely irrelevant.

bq. MergeThreads' priority needs to be controllable, and we need the ability to pause large
merges in favor of small ones
These, and the likes - are not requirements.
These are but one of the possible solutions to our real requirements, which look like
* don't run out of file handles on fast indexation
* don't degrade search performance and NRT turnaround
* don't kill the disk with too much random IOs.

bq. If there are cascading merges (i.e., a result of several other merges), they should all
be executed following the call to MS.merge() - that is, it could be that CMS itself, or its
MergeThreads will encounter merges not returned by MP at first, but as a subsequent round
due to changes done to the index.
This is trivially solved with my pulling model. We pull until nothing is left. Period. Instead
of getting batches of merges from MP and then reconciling them with reality we do the same
operation over and over again, until MP is satisfied - very simple.

bq. The proposal will add a getNextMerge() to MP, instead of IW, which IMO will only complicate
matters for MP implementers. E.g., what should MP do if findRegularMerges was called, then
getNext() was called and then findOptimizeMerges is called? It's not a critical decision we
leave in the MP developers, but IMO it's unnecessary. Today MP is a stateless object - it
receives SegmentInfos and returns a MergeSpec. It doesn't need to 'remember' anything. But
if we move the getNextMerge() to it, we make it stateful, for no good reasons
bq. We don't really take IW outside the loop really - it would still need to instruct MP which
merges to 'prepare', so that MS can take.
There will be, most probably, getNext(Normal/Optimize/Expunge)Merge() methods. findWhatever
methods will be removed, noone needs to call them, so - no state, no 'preparations'.
MP will recieve SegmentInfos and return OneMerge.

bq. To allow for MP dependent sort, I suggest we add to MP a getMergesComparator and use it
in CMS.
MP should return merges sorted, that's all. Why do you need to expose its Comparator or whatever
it uses for sorting?

Whatever I didn't mention from your post - I either missed, or agree with :)
I think I'll stop trying to explain it in Jira comments. It took great time discussing everything
with Mike over IRC, and here it'll take ages.
The proper route is to take a handful of dirt and sticks and slap together some working code
to illustrate my point. And that's what I'm gonna do.

> Some improvements to CMS
> ------------------------
>                 Key: LUCENE-2755
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
> While running optimize on a large index, I've noticed several things that got me to read
CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads
taking merges from the IndexWriter until they are exhausted, and only then that blocked merge
will run. I think it's unnecessary that that merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default
MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore,
I think we should switch the default impl. There are two ways to make it extensible, if we
> ** Have an overridable member/method in CMS that you can extend and override - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs,
calibrate deletes etc.). Better, but will need to tap into several places in the code, so
more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to read and
> I'll work on a patch.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message