hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable
Date Fri, 24 Feb 2017 01:09:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881687#comment-15881687
] 

Daniel Templeton commented on HADOOP-10584:
-------------------------------------------

Resetting the counts isn't the answer.  I can now reproduce this issue reliably by setting
a break point in {{processWatchEvent()}} and shutting down ZK before continuing.  The issue
is a race condition between the events from the ZK client and creating/statting the ZK node.
 If the disconnected update event comes first, all is well.  If not, it will retry a few times
and then fail the RM.

To echo earlier comments, why does ZK connection loss necessitate stopping the RM in this
case?  It doesn't in any other case.  My proposal would be to remove the fatal error completely.
 We could instead either transition to standby explicitly or just ignore the error (and hence
the retries) on connection loss and wait for the ZK event to trigger the transition.  I kinda
like the latter.  Any opinion?

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch, rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum itself is down,
it goes down and the daemons will have to be brought up again. 
> Instead, it should log the fact that it is unable to talk to ZK, call becomeStandby on
its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message