hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
Date Mon, 07 Mar 2016 04:36:41 GMT

    [ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182570#comment-15182570
] 

Varun Vasudev commented on YARN-3998:
-------------------------------------

My apologies for not responding earlier. I think creating a branch is not required. We can
continue to work off trunk as long as the default behaviour preserves the existing behaviour
- i.e. no retries. 

[~hex108] - what will the complexity of using the AM policy of retry windows for this patch?
If it's straight forward, we should do it as part of this patch otherwise we should do it
as part of a follow up.

Just to recap - what I would prefer is we address the following points(in order) -
#  Treat restarts in a first class manner - add the state machine changes required
#  When handling restarts - restart containers with the same local and log dirs(as long as
the disks haven't gone bad)
#  Attempt to unify the AM/Container retry policies if feasible - this can be done as part
of a follow up JIRA because it probably requires some discussion.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, YARN-3998.03.patch, YARN-3998.04.patch,
YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers,
it could specify the value. Then NM will re-launch the container 'retry-times' times when
it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not need to re-schedule
the container. And local files in container's working directory will be left for re-use.(If
container have downloaded some big files, it does not need to re-download them when running
again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message