hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
Date Wed, 09 Dec 2015 17:45:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049050#comment-15049050
] 

Varun Vasudev commented on YARN-3998:
-------------------------------------

I'm thinking of re-opening this issue because we've seen a use case for this - long running
services which (ideally) shouldn't lose local data if the service crashes.

Some pros about supporting some forms of restart policies on the NM -

1. Retry policies can be unified instead of every application having to re-implement their
own.

2. Faster restarts - instead of the NM reaching out to the AM and then deciding what to do(and
maintaining the container work dir), it can make an immediate decision. It's also an easier
change to make - if the NMs need to talk to the AMs to decide whether to restart a container
- we'll probably need a new state transition. Instead if we allow the AMs to specify a restart
policy, the NM can make an immediate decision as soon as the container exits.

3. Similar to what Jun mentioned - when running Docker containers, it's useful to be able
to restart containers that exit with an error code.

When I say restart policies - off the top of my head - I can think of 3 policies - never restart(default),
restart on all errors, restart on specific error codes.

[~jlowe], [~steve_l] - do you guys still feel that this should be done at the app level(and
essentially re-implemented by every app)?

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers,
it could specify the value. Then NM will re-launch the container 'retry-times' times when
it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not need to re-schedule
the container. And local files in container's working directory will be left for re-use.(If
container have downloaded some big files, it does not need to re-download them when running
again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message