hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2010) Handle app-recovery failures gracefully
Date Mon, 03 Nov 2014 05:01:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194250#comment-14194250

Jian He commented on YARN-2010:

Give RM is synchronously starting and renewing the token synchronously, I don't quite understand
why we have to catch the queue exception and stop RM asynchronously via events. I think it's
fine to just let exception throw out and let RM stop.

After a closer look, RUNNING app on recovery will move to ACCEPTED state, ACCEPTED state is
actually not handling RMAppRejectedEvent. I think doing the following will cause UnhandledEventException
in RMApp state machine.
      application.handle(new RMAppEvent(appId, RMAppEventType.RECOVER));
    } catch (Exception e) {
      LOG.warn("Failed to recover application.", e);
      if (!isApplicationInFinalState(appState.getState())) {
            .handle(new RMAppRejectedEvent(appId, e.getMessage()));
We may still need to move {{addApplicationSync}} into RMAppRecoveredTransition. Please let
me know your thoughts. thanks.

> Handle app-recovery failures gracefully
> ---------------------------------------
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch,
yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch,
yarn-2010-7.patch, yarn-2010-8.patch, yarn-2010-9.patch
> Sometimes, the RM fails to recover an application. It could be because of turning security
on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1)
transient, (2) specific to one application, and (3) permanent and apply to multiple (all)
applications. Today, the RM fails to transition to Active and ends up in STOPPED state and
can never be transitioned to Active again.
> The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf

This message was sent by Atlassian JIRA

View raw message