hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) App recovery should be configurable per application
Date Mon, 12 Aug 2013 23:05:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737487#comment-13737487

Karthik Kambatla commented on YARN-1055:

Let me explain what I am getting at with the help of a concrete example.

# User is trying to run a Oozie workflow that has a 10 actions - the 10th one is an MR job
with 100 map tasks.
# The launcher job starts (AM-l) and subsequently starts the MR job - (AM-mr-1); the max-app-attempts
for launcher (AM-l) is set to > 1, say 3.
# After completion of 95 tasks, AM-mr-1 goes down (node or other failure). Ideally, I would
not want to restart the entire oozie workflow for a single AM (may be node) failure. To address
this, I would want to set max-app-attempts for MR-AM to be > 1, say 3.
# Assuming max-app-attempts = 3, the MR job runs a few more tasks.
# When the MR job still has 1 task to go, the RM goes down.
# Post RM-restart, the launcher (AM-l) and MR job (AM-mr-2) are restarted. The launcher re-runs
the MR job - (AM-mr-3). It is possible that AM-mr-2 and AM-mr-3 run at the same time leading
to any number of issues - performance, correctness etc. To avoid this, I would want to set
max-app-attempts = 1 for the MR action. 
# Points 3 (tolerating AM failure) and 6 (tolerating RM failure) require us to set max-app-attempts
to > 1 and =1 respectively at the same time.

Now, consider a separate config for recovering apps on RM restart exists. I could use this
config to address point 6 (the RM failure) and the current max-app-attempts for point 3 (the
AM failure).

Am I overlooking/missing something here. Thoughts?
> App recovery should be configurable per application
> ---------------------------------------------------
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
> In Hadoop-1, the job recovery on JT restart is configurable per-job. For parity and its
usefulness, we should have the same behavior in YARN as well.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message