lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2755) Some improvements to CMS
Date Mon, 15 Nov 2010 15:08:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932072#action_12932072
] 

Shai Erera commented on LUCENE-2755:
------------------------------------

The problem with ThreadPoolExecutor is that its submit() doesn't block on the queue, even
if you pass a bounded ArrayBlockingQueue (which is really silly IMO). I was hoping we can
super simplify CMS logic by letting a BlockingQueue throttle the number of merges we 'register'
before CMS itself waits, and the ExecutorService instead of the MergeThreads and their management.

Unfortunately this does not look to be the case. Here is an alternative solution which looks
a nice workaround: http://stackoverflow.com/questions/2001086/how-to-make-threadpoolexecutors-submit-method-block-if-it-is-saturated.

The idea is to block the call to ExecutorService.execute() through a Semaphore. In that case,
I think it's safe to not use a blocking queue at all, because the throttling will be handled
by the Semaphore.

Another alternative is to use a CallerRunsPolicy as the rejection policy, which has many disadvantages
(such as potentially starving the other threads if the caller gets to execute the heavy task,
or risking running the tasks by N+1 threads etc.).

Earwin - if we make OneMerge comparable, we give any MP the freedom to decide the order merges
will run. In my case it is important because I'm getting a certain time frame to run index
optimization, and prefer to reduce as many segments as possible, therefore I choose to run
the smaller merges first. I think it's a reasonable decision anyway as a default, because
even if you call close(false) (not waiting for merges), then it's better if some merges have
already finished and committed, thereby you're making forward progress all the time, vs. if
you run merges in arbitrary order you mind not finish any merge.

I agree though that in some situations apps won't care, in which case sorting by merge size
will be as good as random ordering.

Pausing large merges is something that I consider less important though, but I don't want
to break back-compat behavior. IMO if a merge started - let it finish. You don't know how
much work it has completed, how much work is left, and how much work does the 'smaller' merge
has (what if say it's smaller by 1 byte/doc?). In different situations the best decision might
be different, therefore IMO we shouldn't pause threads - rather let the MP decide up front
the order of the merges (if it wants to) and then execute them in that order.

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got me to read
CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads
taking merges from the IndexWriter until they are exhausted, and only then that blocked merge
will run. I think it's unnecessary that that merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default
MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore,
I think we should switch the default impl. There are two ways to make it extensible, if we
want:
> ** Have an overridable member/method in CMS that you can extend and override - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs,
calibrate deletes etc.). Better, but will need to tap into several places in the code, so
more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to read and
follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message