hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2010) If RM fails to recover an app, it can never transition to active again
Date Tue, 28 Oct 2014 01:25:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186176#comment-14186176
] 

Jian He commented on YARN-2010:
-------------------------------

bq. Any subsequent attempts to transition the RM to active fail because RMActiveServices is
not INITED, as in the Standby case
 I think YARN-2588 fixed this.  are you running into this error with the patch ?
-  How about moving “addApplicationSync” into RMAppRecoveredTransition. We can catch the
exception inside the transition and return failed state directly ?
{code}
      // If security is enabled and the application is NOT in a final state,
      // parse the credentials and renew delegation token
      if (UserGroupInformation.isSecurityEnabled() &&
          !isApplicationInFinalState(appState.getState())) {
        Credentials credentials = parseCredentials(appContext);
        // synchronously renew delegation token on recovery.
        rmContext.getDelegationTokenRenewer().addApplicationSync(appId,
            credentials, appContext.getCancelTokensWhenComplete());
      }

      // Actual recovery of the application
      application.handle(new RMAppEvent(appId, RMAppEventType.RECOVER));
    } catch (Exception e) {
      LOG.error("Failed to recover application + " + appId, e);
      // Fail the application if it is a running application.
      if (!isApplicationInFinalState(appState.getState())) {
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppRejectedEvent(appId, e.getMessage()));
      }
      throw e;
{code}
- changes in TestWorkPreservingRMRestart
It was purposely done to force RM to fail if the queue is missing for the app and indicate
admin to config the queue properly.

> If RM fails to recover an app, it can never transition to active again
> ----------------------------------------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch,
yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of turning security
on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1)
transient, (2) specific to one application, and (3) permanent and apply to multiple (all)
applications. Today, the RM fails to transition to Active and ends up in STOPPED state and
can never be transitioned to Active again.
> The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message