lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-2755) Some improvements to CMS
Date Tue, 16 Nov 2010 07:01:13 GMT


Shai Erera commented on LUCENE-2755:

I have to say I totally agree w/ Earwin - in my mind, the MP should be the one deciding what
merges to run, and in what order, the MS should be the one executing them. Hack, we should
really call MS a MergeExecutor, since it doesn't really schedule anything, just does what
it's told, based on the execution policy (parallel or blocking).

In my app, it's the MP which decides which merges to run, based on their sizes, time of the
day and allotted time to run. I expect MS to faithfully do what it's told and don't play tricks
on me (like pausing merges I've asked it to run), 'cause otherwise I'll need to write a MS

Another question that was brought up here is who should register the merges. Today there are
two entities - one is the MS which repeatedly calls IW.getNextMerge() and in the CMS case,
MergeThread does so too. The disadvantage of that is that IW.getNextMerge() (and mergeInit())
is called from two places, but the advantage is that it allows executing all of MP merges,
even if they come in several rounds. Example, you have 4 segs of level 0 and 3 segs of level
1 (mergeFactor=4). MP returns a single merge (4 segs level 0), and IW.getNextMerge() returns
null 'cause there are no merges left). However, after completing this merge, if asked, MP
will return a follow on merge of 4 segs level 1. That will be picked up by the MergeThread
that calls IW.getNextMerge().

The thing is - the second round of merges may only be detected by MP after most if not all
of the merges from the previous round ended. Therefore, if we dim that functionality important
(and I do), we must have the MergeThreads query MP as well.

About mergeInit, I took a look too and discovered a 140 lines method, so I doubt it does only
"new segment name registration". But perhaps I'm wrong and those are redundant 140 lines ...

I think that having a MS entity (or MergeExecutor) is important - it still gives an app the
ability to override things if it wants to. Also, putting the MS code inside IW will only add
code to it, and I think that we should really start refactoring it down to smaller, more readable
and focused, pieces. So I'm against adding more logic to it. For 3x we can choose to improve
things internally, or leave them as they are. For 4.0 I'd suggest we do the following:
* Create a new MergeExecutor entity receives an Executor(Service?) to run with, but also defaults
to one (like CMS is today). That replaces CMS.
** You can control the number of threads it runs with (1 for almost-SMS-like behavior and
more for CMS).
* We create a BlockingMergeExecutor, which regardless of how many threads you allow it to
run with, blocks until all merges finish.
** That is an improved SMS - I've always thought that blocking until merges finish is not
related to how many threads you'd like to execute merges with. E.g., if you set CMS's # threads
to 1, you get a sort of an SMS behavior, only the call is non-blocking.
* We define MP as the one responsible for returning the merges in the order it wants, and
provide the necessary support by making OneMerge comparable (Earwin, MP can still sort by
using a custom Comparator, we only provide a default comparison method for merges). That definition
will be mostly in javadocs.

> Some improvements to CMS
> ------------------------
>                 Key: LUCENE-2755
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
> While running optimize on a large index, I've noticed several things that got me to read
CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads
taking merges from the IndexWriter until they are exhausted, and only then that blocked merge
will run. I think it's unnecessary that that merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default
MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore,
I think we should switch the default impl. There are two ways to make it extensible, if we
> ** Have an overridable member/method in CMS that you can extend and override - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs,
calibrate deletes etc.). Better, but will need to tap into several places in the code, so
more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to read and
> I'll work on a patch.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message