hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
Date Wed, 03 Feb 2016 09:36:40 GMT

    [ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130109#comment-15130109
] 

Jun Gong commented on YARN-3998:
--------------------------------

[~vvasudev], I just attached a new patch to address above problems. Thanks for review.

1) When finding container's previous working directory and log directory, just locate corresponding
files in good directories which could be read/write and not full.

2)  Limiting diagnostic message's message to 10000 bytes. If the length is greater than it,
delete the first line whose separator is "\n".

3) After some container retries,  env variable  *MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX*(DEFAULT_NM_ADMIN_USER_ENV)
will be expanded to *MALLOC_ARENA_MAX=::::::::::::::::::::::*(a lot of ":"). I fixed it in
*Apps#addToEnvironment*.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, YARN-3998.03.patch, YARN-3998.04.patch,
YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers,
it could specify the value. Then NM will re-launch the container 'retry-times' times when
it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not need to re-schedule
the container. And local files in container's working directory will be left for re-use.(If
container have downloaded some big files, it does not need to re-download them when running
again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message