hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
Date Fri, 14 Nov 2014 14:19:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212290#comment-14212290
] 

Hudson commented on YARN-2846:
------------------------------

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/])
YARN-2846. Incorrect persist exit code for running containers in reacquireContainer() that
interrupted by NodeManager restart. Contributed by Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5)
* hadoop-yarn-project/CHANGES.txt
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java


> Incorrect persist exit code for running containers in reacquireContainer() that interrupted
by NodeManager restart.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2846
>                 URL: https://issues.apache.org/jira/browse/YARN-2846
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Blocker
>             Fix For: 2.6.0
>
>         Attachments: YARN-2846-demo.patch, YARN-2846.patch
>
>
> The NM restart work preserving feature could make running AM container get LOST and killed
during stop NM daemon. The exception is like below:
> {code}
> 2014-11-11 00:48:35,214 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408))
- Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_000084:
53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used
> 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60))
- RECEIVED SIGNAL 15: SIGTERM
> 2014-11-11 00:48:35,299 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060
> 2014-11-11 00:48:35,337 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512))
- Applications still running : [application_1415666714233_0001]
> 2014-11-11 00:48:35,338 INFO  ipc.Server (Server.java:stop(2437)) - Stopping server on
45454
> 2014-11-11 00:48:35,344 INFO  ipc.Server (Server.java:run(706)) - Stopping IPC Server
listener on 45454
> 2014-11-11 00:48:35,346 INFO  logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141))
- org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
waiting for pending aggregation during exit
> 2014-11-11 00:48:35,347 INFO  ipc.Server (Server.java:run(832)) - Stopping IPC Server
Responder
> 2014-11-11 00:48:35,347 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502))
- Aborting log aggregation for application_1415666714233_0001
> 2014-11-11 00:48:35,348 WARN  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382))
- Aggregation did not complete for application application_1415666714233_0001
> 2014-11-11 00:48:35,358 WARN  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476))
- org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
is interrupted. Exiting.
> 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87))
- Unable to recover container container_1415666714233_0001_01_000001
> java.io.IOException: Interrupted while waiting for process 20001 to exit
>         at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177)
>         ... 6 more
> {code}
> In reacquireContainer() of ContainerExecutor.java, the while loop of checking container
process (AM container) will be interrupted by NM stop. The IOException get thrown and failed
to generate an ExitCodeFile for the running container. Later, the IOException will be caught
in upper call (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST without
any setting) get persistent in NMStateStore. 
> After NM restart again, this container is recovered as COMPLETE state but exit code is
LOST (154) - cause this (AM) container get killed later.
> We should get rid of recording the exit code of running containers if detecting process
is interrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message