hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shane Kumpf (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.
Date Thu, 29 Mar 2018 12:56:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418944#comment-16418944

Shane Kumpf commented on YARN-7278:

I believe this is now resolved with the changes added by YARN-5366. We no longer call {{docker
rm}} prior to writing the exit code and no longer depend on {{docker wait}}. Closing this
for now, but please reopen if you see this after applying that patch.

> LinuxContainer in docker mode will be failed when nodemanager restart, because timeout
for docker is too slow.
> --------------------------------------------------------------------------------------------------------------
>                 Key: YARN-7278
>                 URL: https://issues.apache.org/jira/browse/YARN-7278
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0
>         Environment: CentOS
>            Reporter: zhengchenyu
>            Priority: Major
>             Fix For: 2.9.1
>   Original Estimate: 1m
>  Remaining Estimate: 1m
> In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer with docker
> Container may be failed when nodemanager restart, exception is below:
> {code}
> [2017-09-29T15:47:14.433+08:00] [INFO] containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java
472) [Container Monitor] : Memory usage of ProcessTree 120523 for container-id container_1506600355508_0023_01_000004:
-1B of 10 GB physical memory used; -1B of 31 GB virtual memory used
> [2017-09-29T15:47:15.219+08:00] [ERROR] containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java
93) [ContainersLauncher #1] : Unable to recover container container_1506600355508_0023_01_000004
> java.io.IOException: Timeout while waiting for exit code from container_1506600355508_0023_01_000004
> [2017-09-29T15:47:15.220+08:00] [INFO] containermanager.container.ContainerImpl.handle(ContainerImpl.java
1142) [AsyncDispatcher event handler] : Container container_1506600355508_0023_01_000004 transitioned
> [2017-09-29T15:47:15.221+08:00] [INFO] containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java
440) [AsyncDispatcher event handler] : Cleaning up container container_1506600355508_0023_01_000004
> {code}
> I guess the proccess is done, but 2 seconde later( the variable is msecLeft), the *.pid.exitcode
wasn't created. Then I changed variable to 20000ms, The container is succeed when nodemanger
is restart.
> So I think it is too short for docker container to complete the work.
> In docker mode of LinuxContainer, nm monitor the real task which is launched by "docker
run" command. Then "docker wait" command will wait for exitcode, then "docker rm" will delete
the docker container. Lastly, container-executor will write the exit code. So if some docker
command is slow enough, nm wouldn't monitor the container. In fact, docker rm is always slow.

> I think the exit code of docker rm dosen't matter with the real task, so I think we could
move the operation of write "*.pid.exitcode" before the command of docker rm. Or monitor the
docker wait proccess, but not the real task.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message