hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
Date Wed, 14 Aug 2013 21:49:47 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740242#comment-13740242
] 

Bikas Saha commented on YARN-1055:
----------------------------------

First of all, whatever needs to be set must be set in the AppSubmissionContext API for that
job. Only that is job specific and this config cannot be global across all jobs.

By MAPREDUCE-4824 on job submission, we set a property in job conf (that is job specific)
saying not to retry the job.
In YARN, on job submission, in the AppSubmissionContext API (that is job specific), we say
that max-am-retries = 1.

For a job that cannot be restarted, (either due to AM crash or node crash or RM restart AND
all these are indistinguishable wrt to the job) the per job max-am-retries needs to be set
to 1. Its probably 2 weeks worth of work to remove RM restart from the above list. Even after
that, such a job needs to set max-am-retries = 1 so that RM does not restart the job when
the node crashes or AM crashes. Why does an rm restart related special API need to be added
now?

                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and
RM currently relies on the max-attempts config; tolerating AM failures requires it to be >
1 and tolerating RM failure/restart requires it to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message