hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
Date Wed, 14 Aug 2013 23:50:51 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740435#comment-13740435
] 

Karthik Kambatla commented on YARN-1055:
----------------------------------------

In Hadoop 1, we set the job.recovery.enable setting to true for the launcher job and false
for the action job. When JT restarts, the launcher alone is recovered. The recovered launcher
then starts the action exactly the same way as before.

In Hadoop 2, that translates to setting the max-am-retries to > 1 for the launcher job
and = 1 for the action job. When RM restarts, the launcher alone is recovered, which restarts
the action. However, if the action-AM alone dies (due to the node running it crashing etc.)
and the launcher-AM doesn't, the launcher does not retry the action. IOW, the failure is ignored.

                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and
RM currently relies on the max-attempts config; tolerating AM failures requires it to be >
1 and tolerating RM failure/restart requires it to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message