Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <20726517.93021289860213710.JavaMail.jira@thor>
Date: Mon, 15 Nov 2010 17:30:13 -0500 (EST)
From: "Michael McCandless (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-2755) Some improvements to CMS
In-Reply-To: <14053481.25971289483174131.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932234#action_12932234 ] 

Michael McCandless commented on LUCENE-2755:
--------------------------------------------

{quote}
That has something to do with assigning new segment names, if you believe the comments.
But IW.mergeInit does a freakload of other stuff! I think assigning names can happen in a separate place, before OneMerge is submitted to MS.
{quote}

If indeed that's all then I agree, let's just assign the name up front and then CMS need not call mergeInit.

{quote}
bq. Otherwise, when a laaarge merge is taking place, it causes to to fully stop your indexing threads unnecessarily

I still think this can be mitigated in more appropriate ways. Like allocating big enough pending merges queue to wait until the long one finishes.
Indexing threads push merges into the queue (with CMS) and don't block.
{quote}

But then you accumulate too many tiny merges, while waiting for the big one to finish?

bq. Plus to that, you can use nice policies like BalancedSegmentMergePolicy, that prevent UBER-merges from occuring at all.

Maybe we should move BSMP to core and make it the default?

But I don't fully understand how it chooses merges.  EG does it pick lopsided merges (where the segments differ substantially in size), as long as they are "small" segments?

{quote}
MergePolicy decides which merges should run NOW, MergeScheduler executes them.
If a certain big merge should run only within some specific timeframe, MergePolicy should not return it when asked for eligible merges.
{quote}

I agree there is ambiguity here, which is not good.  It is tempting to nuke MergeScheduler (absorb CMS into IW, w/ SMS a special case) and define MergePolicy to only return merges which should run right now... that would be a nice simplification.

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads taking merges from the IndexWriter until they are exhausted, and only then that blocked merge will run. I think it's unnecessary that that merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore, I think we should switch the default impl. There are two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs, calibrate deletes etc.). Better, but will need to tap into several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to read and follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org