hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
Date Tue, 13 Aug 2013 23:42:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739027#comment-13739027

Alejandro Abdelnur commented on YARN-1055:

[~vinodkv], in theory I agree with you. In practice, there are 2 issues we Oozie cannot address
in the short term:

* 1. Oozie still using a a launcher MRAM
* 2. mr/pig/hive/sqoop/distcp/... fat clients which are not aware of Yarn restart/recovery.

#1 will be addressed when Oozie implements an OozieLauncherAM instead piggybacking on an MR
Map as driver.
#2 it is more complicated and I don't see this one be addressed in the short/medium term.

By having distinct knobs differentiating recover after AM failure and after RM restart Oozie
can handle/recover jobs on the same set of failure scenarios possible with Hadoop 1. In order
to get folks into Yarn we need to provide functional parity.

I suggest having the 2 knobs Karthik proposed {{restart.am.on.rm.restart}} and {{restart.am.on.on.failure}}
with {{restart.am.on.rm.restart=$restar.am.on.am.failure}}. 

Does this sound reasonable?
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
> Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and
RM currently relies on the max-attempts config; tolerating AM failures requires it to be >
1 and tolerating RM failure/restart requires it to be = 1.
> We should handle these two differently, with two separate configs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message