hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable
Date Tue, 16 Jun 2015 01:21:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587287#comment-14587287

Xuan Gong commented on HADOOP-10584:

[~vinodkv] [~kasha]

bq. From my previous investigation, the patch I posted should help.

It does help. But in the patch, reJoinElection(0) is called, which will further call  joinElectionInternal
  private void joinElectionInternal() {
    Preconditions.checkState(appData != null,
        "trying to join election without any app data");
    if (zkClient == null) {
      if (!reEstablishSession()) {
        fatalError("Failed to reEstablish connection with ZooKeeper");
    createRetryCount = 0;
    wantToBeInElection = true;
Since the ZK quorum is unavailable, we still have the same issue. The difference is that with
the patch we will retry 45s more(by using the default configuration).

So if we will want to use retry-then-exist pattern, I think that both current code and current
code + the patch are fine. We also need to modify the configurations based on the cluster.

Or, if we do not expect RM exists because of this reason (ZK quorum is unavailable), instead
of doing
    public void handle(RMFatalEvent event) {
      LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
          event.getType().name() + ". Cause:\n" + event.getCause());

      ExitUtil.terminate(1, event.getCause());

We could check the eventType, and transit the RM to standby ,then rejoin electorService.

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch, rm.log
> ActiveStandbyElector retries operations for a few times. If the ZK quorum itself is down,
it goes down and the daemons will have to be brought up again. 
> Instead, it should log the fact that it is unable to talk to ZK, call becomeStandby on
its client, and continue to attempt connecting to ZK.

This message was sent by Atlassian JIRA

View raw message