hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rakesh R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable
Date Thu, 18 Jun 2015 06:59:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591365#comment-14591365
] 

Rakesh R commented on HADOOP-10584:
-----------------------------------

Sorry for pitch in late. After looking at the logic, I also feel this case can occur in the
production clusters. On ZooKeeper connection loss ActiveStandbyElector will do certain number
of retries and finally notifies {{ActiveStandbyElectorCallback#notifyFatalError()}}. I could
see the {{EmbeddedElectorService#notifyFatalError}} implementation is handling the case by
immediately terminating the service. I think we have room to improve this logic instead of
immediately terminating.

About the proposed patch, IIUC it is not required to do an additional handling of ZooKeeper
exceptions and do re-election in ActiveStandbyElector class. Presently we have {{ActiveStandbyElector#processWatchEvent}}
logic to handle the ZK connection state changes. On connection state changes, ZooKeeper client
will notify this to the registered ZK watcher like, SyncConnected, Disconnected, Expired etc.
Based on the STATE {{ActiveStandbyElector}} is notifying the registered {{ActiveStandbyElectorCallback}}
and does the state transitions. Please see [ActiveStandbyElector.java#L550|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ActiveStandbyElector.java#L550]

What I meant is ZooKeeper client will be alive which internally does the connection re-establishment
infinitely. IMHO, we could think of implemeting {{EmbeddedElectorService#enterNeutralMode}}
to handle the NEUTRAL transition of RM. Also, {{ActiveStandbyElectorCallback#notifyFatalError()}}
has to be refined. Any thoughts?

{code}
  public void enterNeutralMode() {
    /**
     * Possibly due to transient connection issues. Do nothing.
     * TODO: Might want to keep track of how long in this state and transition
     * to standby.
     */
  }
{code}

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch, rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum itself is down,
it goes down and the daemons will have to be brought up again. 
> Instead, it should log the fact that it is unable to talk to ZK, call becomeStandby on
its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message