hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2010) Handle app-recovery failures gracefully
Date Tue, 04 Nov 2014 07:52:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195847#comment-14195847
] 

Karthik Kambatla commented on YARN-2010:
----------------------------------------

Jian - thanks for taking the time to look at this closely. The patch looks mostly good. However,
this patch only catches DT renewal issues during app recovery. Any other errors that could
be encountered in the remaining code paths are not handled gracefully. For instance, any errors
during app.recoverAppAttempts can affect the health of the RM.



> Handle app-recovery failures gracefully
> ---------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-10.patch,
yarn-2010-11.patch, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch,
yarn-2010-5.patch, yarn-2010-6.patch, yarn-2010-7.patch, yarn-2010-8.patch, yarn-2010-9.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of turning security
on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1)
transient, (2) specific to one application, and (3) permanent and apply to multiple (all)
applications. Today, the RM fails to transition to Active and ends up in STOPPED state and
can never be transitioned to Active again.
> The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message