lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Commit Tag Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-4099) Suspect zookeeper client thread doesn't call back the watcher, that occur the overseer collection can't work normal.
Date Wed, 21 Nov 2012 15:31:59 GMT

    [ https://issues.apache.org/jira/browse/SOLR-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502043#comment-13502043
] 

Commit Tag Bot commented on SOLR-4099:
--------------------------------------

[branch_4x commit] Mark Robert Miller
http://svn.apache.org/viewvc?view=revision&revision=1412142

SOLR-4099: Allow the collection api work queue to make forward progress even when it's watcher
is not fired for some reason.


                
> Suspect zookeeper client thread doesn't call back the watcher, that occur the overseer
collection can't work normal.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-4099
>                 URL: https://issues.apache.org/jira/browse/SOLR-4099
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0
>         Environment: Zookeeper version: 3.2
>            Reporter: Raintung Li
>            Assignee: Mark Miller
>             Fix For: 4.1, 5.0
>
>         Attachments: patch-4099.txt
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> In test environment, our zookeeper version is old that our requirement version. Not use
solr default 3.3.6 version.
> The overseer collection processor stop work. Trace the dump, the overseer wait for LatchChildWatcher.await.

> Check the zookeeper /overseer/collection-queue-work, block a lot of operation for collection.

> Check the logic, suspect the zookeeper client doesn't call back the watchevent that register
the path "/overseer/collection-queue-work", unlucky the log level is debug. 
> This case doesn't happen often, very little. But if it happen, it is fatal, we have to
stop the leader server.
> Suggest the compensate solution, that doesn't await until notify. Only wait some time
that maybe it is ten minutes or a half of hour or other value to recheck the queue again.
Of cause if get the notify, that can direct work normal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message