hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
Date Tue, 13 Aug 2013 23:42:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739027#comment-13739027
] 

Alejandro Abdelnur commented on YARN-1055:
------------------------------------------

[~vinodkv], in theory I agree with you. In practice, there are 2 issues we Oozie cannot address
in the short term:

* 1. Oozie still using a a launcher MRAM
* 2. mr/pig/hive/sqoop/distcp/... fat clients which are not aware of Yarn restart/recovery.

#1 will be addressed when Oozie implements an OozieLauncherAM instead piggybacking on an MR
Map as driver.
#2 it is more complicated and I don't see this one be addressed in the short/medium term.

By having distinct knobs differentiating recover after AM failure and after RM restart Oozie
can handle/recover jobs on the same set of failure scenarios possible with Hadoop 1. In order
to get folks into Yarn we need to provide functional parity.

I suggest having the 2 knobs Karthik proposed {{restart.am.on.rm.restart}} and {{restart.am.on.on.failure}}
with {{restart.am.on.rm.restart=$restar.am.on.am.failure}}. 

Does this sound reasonable?
                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and
RM currently relies on the max-attempts config; tolerating AM failures requires it to be >
1 and tolerating RM failure/restart requires it to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message