hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2010) Handle app-recovery failures gracefully
Date Sat, 01 Nov 2014 17:56:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193311#comment-14193311
] 

Karthik Kambatla commented on YARN-2010:
----------------------------------------

The latest patch is a step back, closer to v6 patch. Fixing the test failures on v7 of the
patch was more involved than I thought and was taking longer. So, in the interest of time,
I would like to work on moving credential parsing to RMAppRecoveredTransition as part of a
follow-up JIRA. 

bq. Inside the catch, we may just return FAILED?
This doesn't apply anymore. Will take a closer look in the follow-up JIRA.

bq. I don’t think we can get ConnectException here, could you explain under what scenario,
we get ConnectException
The comments elaborate on potential reasons for ConnectException. The stack trace corresponding
to one instance is here - https://issues.apache.org/jira/browse/YARN-2010?focusedCommentId=14164516&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14164516



> Handle app-recovery failures gracefully
> ---------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch,
yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch,
yarn-2010-7.patch, yarn-2010-8.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of turning security
on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1)
transient, (2) specific to one application, and (3) permanent and apply to multiple (all)
applications. Today, the RM fails to transition to Active and ends up in STOPPED state and
can never be transitioned to Active again.
> The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message