hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) App recovery should be configurable per application
Date Mon, 12 Aug 2013 23:05:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737487#comment-13737487
] 

Karthik Kambatla commented on YARN-1055:
----------------------------------------

Let me explain what I am getting at with the help of a concrete example.

# User is trying to run a Oozie workflow that has a 10 actions - the 10th one is an MR job
with 100 map tasks.
# The launcher job starts (AM-l) and subsequently starts the MR job - (AM-mr-1); the max-app-attempts
for launcher (AM-l) is set to > 1, say 3.
# After completion of 95 tasks, AM-mr-1 goes down (node or other failure). Ideally, I would
not want to restart the entire oozie workflow for a single AM (may be node) failure. To address
this, I would want to set max-app-attempts for MR-AM to be > 1, say 3.
# Assuming max-app-attempts = 3, the MR job runs a few more tasks.
# When the MR job still has 1 task to go, the RM goes down.
# Post RM-restart, the launcher (AM-l) and MR job (AM-mr-2) are restarted. The launcher re-runs
the MR job - (AM-mr-3). It is possible that AM-mr-2 and AM-mr-3 run at the same time leading
to any number of issues - performance, correctness etc. To avoid this, I would want to set
max-app-attempts = 1 for the MR action. 
# Points 3 (tolerating AM failure) and 6 (tolerating RM failure) require us to set max-app-attempts
to > 1 and =1 respectively at the same time.

Now, consider a separate config for recovering apps on RM restart exists. I could use this
config to address point 6 (the RM failure) and the current max-app-attempts for point 3 (the
AM failure).

Am I overlooking/missing something here. Thoughts?
                
> App recovery should be configurable per application
> ---------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> In Hadoop-1, the job recovery on JT restart is configurable per-job. For parity and its
usefulness, we should have the same behavior in YARN as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message