hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsuyoshi OZAWA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
Date Mon, 05 May 2014 22:09:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990037#comment-13990037

Tsuyoshi OZAWA commented on YARN-2019:

RMStateStore handles the exceptions in ZKRMStateStore like this: 
    try {
      // ZK related operations
    } catch (Exception e) {

If it's fenced, RMFatalEventDispatcher handles the exceptions and RM goes into standby state.
However, if STATE_STORE_OP_FAILED occurs, Active RM terminates. After fail-over to standby
RM, the exception could be repeated on new active RM. Maybe this is the case [~djp] mentioned.
Please correct me if I get wrong.

  public static class RMFatalEventDispatcher
      implements EventHandler<RMFatalEvent> {
    public void handle(RMFatalEvent event) {
      LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
          event.getType().name() + ". Cause:\n" + event.getCause());

      if (event.getType() == RMFatalEventType.STATE_STORE_FENCED) {
        LOG.info("RMStateStore has been fenced");
        if (rmContext.isHAEnabled()) {
          try {
            // Transition to standby and reinit active services
            LOG.info("Transitioning RM to Standby mode");
          } catch (Exception e) {
            LOG.fatal("Failed to transition RM to Standby mode.");

      ExitUtil.terminate(1, event.getCause());

> Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
> ------------------------------------------------------------------------------------
>                 Key: YARN-2019
>                 URL: https://issues.apache.org/jira/browse/YARN-2019
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Priority: Critical
>              Labels: ha
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal exception
to crash RM down. As shown in YARN-1924, it could due to RM HA internal bug itself, but not
fatal exception. We should retrospect some decision here as HA feature is designed to protect
key component but not disturb it.

This message was sent by Atlassian JIRA

View raw message