hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2010) Handle app-recovery failures gracefully
Date Mon, 03 Nov 2014 18:10:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194810#comment-14194810
] 

Karthik Kambatla commented on YARN-2010:
----------------------------------------

bq. Give RM is synchronously starting and renewing the token synchronously, I don't quite
understand why we have to catch the queue exception and stop RM asynchronously via events.
I think it's fine to just let exception throw out and let RM stop.
This is not always on startup. Transitions to Active also go through this. In HA cases, we
would want to transition to standby, no? 

bq. After a closer look, RUNNING app on recovery will move to ACCEPTED state, ACCEPTED state
is actually not handling RMAppRejectedEvent.
Good point. What do you think of handling rejection in ACCEPTED as well? 

bq. We may still need to move addApplicationSync into RMAppRecoveredTransition.
I am not sure if this is necessarily related to the rest of the patch. It is definitely a
code improvement. 

> Handle app-recovery failures gracefully
> ---------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch,
yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch,
yarn-2010-7.patch, yarn-2010-8.patch, yarn-2010-9.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of turning security
on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1)
transient, (2) specific to one application, and (3) permanent and apply to multiple (all)
applications. Today, the RM fails to transition to Active and ends up in STOPPED state and
can never be transitioned to Active again.
> The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message