hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
Date Tue, 13 Aug 2013 23:55:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739040#comment-13739040
] 

Vinod Kumar Vavilapalli commented on YARN-1055:
-----------------------------------------------

This is a new issue with Hadoop 2 completely - we've added new failure conditions. All the
apps handing AM restarts is really the right way forward given AMs can now run on random compute
nodes that can just fail any time. Offline I started engaging some of Pig/Hive community folks.
For MR, enough work is already done. Oozie needs to follow suit too.

Till work-preserving restart is finished, this is a real pain on RM restarts. Which is why
I am proposing that oozie set max-attempts to 1 for its launcher action so that there are
no split brain issues - RM restart or otherwise. Oozie has a retry mechanism anyways which
will then be submitted as a new application.

Adding a separate knob just for restart is a hack I don't see any value of. If I read your
proposal correctly, for launcher jobs, you will set restart.am.on.rm.restart to 1 and  restart.am.on.on.failure
> 1. Right? Which is not correct as I repeated - node failures will cause the same split
brain issues.
                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and
RM currently relies on the max-attempts config; tolerating AM failures requires it to be >
1 and tolerating RM failure/restart requires it to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message