hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
Date Tue, 02 Feb 2016 08:59:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127929#comment-15127929
] 

Jun Gong commented on YARN-3998:
--------------------------------

Thanks [~vvasudev] for the detailed review and comments. I will update the patch.

{quote}
9) In ContainerLaunch.java, in getContainerWorkDir, we should use dirsHandler.getLocalPathForWrite
instead of dirsHandler.getLocalPathForRead.
{quote}
*getLocalPathForWrite* is used for allocating a new directory, I just want to search container's
token file in all log paths in *getContainerLogDir*, so I use *getLocalPathForRead*. If the
token file is not found, a new directory will be allocated by *getLocalPathForWrite*.

{quote}
A couple of additional thoughts - I would like to restart containers for kill and term signals
which don't originate from YARN and handle scenarios where disks have gone bad and resources
need to be relocalized but both those cases can be handled as follow up JIRAs.
{quote}
OK. I will create another JIRA to handle these cases.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, YARN-3998.03.patch, YARN-3998.04.patch,
YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers,
it could specify the value. Then NM will re-launch the container 'retry-times' times when
it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not need to re-schedule
the container. And local files in container's working directory will be left for re-use.(If
container have downloaded some big files, it does not need to re-download them when running
again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message