hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsuyoshi OZAWA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
Date Mon, 05 May 2014 22:09:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990037#comment-13990037
] 

Tsuyoshi OZAWA commented on YARN-2019:
--------------------------------------

RMStateStore handles the exceptions in ZKRMStateStore like this: 
{code}
    try {
      // ZK related operations
      removeRMDTMasterKeyState(delegationKey);
    } catch (Exception e) {
      notifyStoreOperationFailed(e);
    }
{code}

If it's fenced, RMFatalEventDispatcher handles the exceptions and RM goes into standby state.
However, if STATE_STORE_OP_FAILED occurs, Active RM terminates. After fail-over to standby
RM, the exception could be repeated on new active RM. Maybe this is the case [~djp] mentioned.
Please correct me if I get wrong.

{code}
  @Private
  public static class RMFatalEventDispatcher
      implements EventHandler<RMFatalEvent> {
    @Override
    public void handle(RMFatalEvent event) {
      LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
          event.getType().name() + ". Cause:\n" + event.getCause());

      if (event.getType() == RMFatalEventType.STATE_STORE_FENCED) {
        LOG.info("RMStateStore has been fenced");
        if (rmContext.isHAEnabled()) {
          try {
            // Transition to standby and reinit active services
            LOG.info("Transitioning RM to Standby mode");
            rm.transitionToStandby(true);
            return;
          } catch (Exception e) {
            LOG.fatal("Failed to transition RM to Standby mode.");
          }
        }
      }

      ExitUtil.terminate(1, event.getCause());
    }
  }
{code}



> Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2019
>                 URL: https://issues.apache.org/jira/browse/YARN-2019
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Priority: Critical
>              Labels: ha
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal exception
to crash RM down. As shown in YARN-1924, it could due to RM HA internal bug itself, but not
fatal exception. We should retrospect some decision here as HA feature is designed to protect
key component but not disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message