hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Riccomini (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups
Date Mon, 24 Jun 2013 19:38:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692296#comment-13692296
] 

Chris Riccomini commented on YARN-864:
--------------------------------------

Hey Jian,

With your patch applied, the new error (in the NM) is:

{noformat}
19:33:36,741  INFO NodeStatusUpdaterImpl:365 - Node is out of sync with ResourceManager, hence
rebooting.
19:33:36,764  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 14751 for container-id
container_1372091455469_0002_01_000002: 779.3 MB of 1.3 GB physical memory used; 1.6 GB of
10 GB virtual memory used
19:33:37,239  INFO NodeManager:315 - Rebooting the node manager.
19:33:37,261  INFO NodeManager:229 - Containers still running on shutdown: [container_1372091455469_0002_01_000002]
19:33:37,278 FATAL AsyncDispatcher:137 - Error in dispatcher thread
org.apache.hadoop.metrics2.MetricsException: Metrics source JvmMetrics already exists!
	at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126)
	at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107)
	at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
	at org.apache.hadoop.metrics2.source.JvmMetrics.create(JvmMetrics.java:79)
	at org.apache.hadoop.yarn.server.nodemanager.metrics.NodeManagerMetrics.create(NodeManagerMetrics.java:49)
	at org.apache.hadoop.yarn.server.nodemanager.metrics.NodeManagerMetrics.create(NodeManagerMetrics.java:45)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.<init>(NodeManager.java:75)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.createNewNodeManager(NodeManager.java:357)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.reboot(NodeManager.java:316)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:348)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
	at java.lang.Thread.run(Thread.java:619)
{noformat}

For the record, you can reproduce this yourself by:

1. Start a YARN RM and NM.
2. Run a YARN job on the cluster that uses at least one container.
3. Run kill -STOP <NM PID> on the NM.
4. Wait 65 seconds (enough for the NM to time out).
5. Run kill -CONT <NM PID>

You will see the NM trigger a reboot since it's out of sync with the RM.
                
> YARN NM leaking containers with CGroups
> ---------------------------------------
>
>                 Key: YARN-864
>                 URL: https://issues.apache.org/jira/browse/YARN-864
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.0.5-alpha
>         Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and YARN-600.
>            Reporter: Chris Riccomini
>         Attachments: rm-log, YARN-864.1.patch
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers
getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before?
I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means
that container container_1371141151815_0008_03_000002 was killed by YARN, either due to being
released by the application master or being 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_000002
was assigned task ID 0. Requesting a new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Thread.join(Thread.java:1143)
>         at java.lang.Thread.join(Thread.java:1196)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
>         at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
>         at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
>         at java.lang.Thread.run(Thread.java:619)
> 05:35:00,314  WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
is interrupted. Exiting.
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
>         at org.apache.hadoop.util.Shell.run(Shell.java:129)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
>         at org.apache.hadoop.util.Shell.run(Shell.java:129)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> {noformat}
> And, if I look on the machine that's running container_1371141151815_0008_03_000002,
I see:
> {noformat}
> $ ps -ef | grep container_1371141151815_0008_03_000002
> criccomi  5365 27915 38 Jun18 ?        21:35:05 /export/apps/jdk/JDK-1_6_0_21/bin/java
-cp /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/...
> {noformat}
> The same holds true for container_1371141151815_0006_01_001598. When I look in the container
logs, it's just happily running. No kill signal appears to be sent, and no error appears.
> Lastly, the RM logs show no major events around the time of the leak (5:35am). I am able
to reproduce this simply by waiting about 12 hours, or so, and it seems to have started happening
after I switched over to CGroups and LCE, and turned on stateful RM (using file system).
> Any ideas what's going on?
> Thanks!
> Chris

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message