lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Earwin Burrfoot (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2755) Some improvements to CMS
Date Mon, 15 Nov 2010 23:55:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932266#action_12932266
] 

Earwin Burrfoot commented on LUCENE-2755:
-----------------------------------------

bq. But then you accumulate too many tiny merges, while waiting for the big one to finish?
You say this, as if it was something terribly wrong. :)
Big merges aren't heffalumps, they don't usually stalk IW in droves. Big merge ends sooner
or later, and tiny ones go out in a flash.

bq. Maybe we should move BSMP to core and make it the default?
Dunno. The index you end up with is larger than with LogWhateverMP.
But you get a nice benefit of having roughly equal-sized big segments, which is cool for running
collection in parallel.
Everyone has his own requirements.

bq. But I don't fully understand how it chooses merges. EG does it pick lopsided merges (where
the segments differ substantially in size), as long as they are "small" segments?
Docs say small-sized segments are treated as with LogByteSizeMP.



Another thought I had looking through the code. We have seriously inefficient "merge conflict"
resolution algorithm on our hands.
We just damn drop all new merges that have segments in common with the merges already queued
(but not yet running!!).
What does that mean?

Imagine we're producing a slew of mini-segments with decent speed and our MergeScheduler is
lagging behind:
* new seg1
* new seg2
* queue merge seg1+seg2
* start merge seg1+seg2
* new seg3
* new seg4
* queue merge seg3+seg4
* new seg5
* FAIL queue merge seg3+seg4+seg5
* new seg6
* FAIL queue merge seg3+seg4+seg5+seg6
* finish merge seg1+seg2
* start merge seg3+seg4

By that point we should really start merging of all four last segments (maybe together with
the result of seg1+seg2).
But in reality we'll merge seg3+seg4, than seg5+seg6 and then all of three merge results together
(provided no new mini-segments are added).

If we throw large merges into the loop (whether pausable or not) the situation is amplified.

Ugly solution - when MP suggests a merge that is a strict superset of a queued, but not yet
running merge - drop the old one, use the new.
Better solution - instead of asking MP for all the merges it deems reasonable on current index,
we only ask it for "most important" one.
And we do it each time MS has an open slot for execution. This way each merge happening is
the best merge possible at that moment.

Please, correct my wrongs, if any.

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got me to read
CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads
taking merges from the IndexWriter until they are exhausted, and only then that blocked merge
will run. I think it's unnecessary that that merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default
MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore,
I think we should switch the default impl. There are two ways to make it extensible, if we
want:
> ** Have an overridable member/method in CMS that you can extend and override - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs,
calibrate deletes etc.). Better, but will need to tap into several places in the code, so
more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to read and
follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message