Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB1DB100F1 for ; Thu, 20 Jun 2013 17:56:22 +0000 (UTC) Received: (qmail 42628 invoked by uid 500); 20 Jun 2013 17:56:22 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 42586 invoked by uid 500); 20 Jun 2013 17:56:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 42377 invoked by uid 99); 20 Jun 2013 17:56:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jun 2013 17:56:21 +0000 Date: Thu, 20 Jun 2013 17:56:20 +0000 (UTC) From: "Chris Riccomini (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-864) YARN NM leaking containers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Riccomini updated YARN-864: --------------------------------- Description: Hey Guys, I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. When I look in my AM logs for my app (not an MR app master), I see: 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_000002 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_000002 was assigned task ID 0. Requesting a new container for the task. The AM has been running steadily the whole time. Here's what the NM logs say: {noformat} 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:619) 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) {noformat} And, if I look on the machine that's running container_1371141151815_0008_03_000002, I see: {noformat} $ ps -ef | grep container_1371141151815_0008_03_000002 criccomi 5365 27915 38 Jun18 ? 21:35:05 /export/apps/jdk/JDK-1_6_0_21/bin/java -cp /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/... {noformat} The same holds true for container_1371141151815_0006_01_001598. When I look in the container logs, it's just happily running. No kill signal appears to be sent, and no error appears. Lastly, the RM logs show no major events around the time of the leak (5:35am). I am able to reproduce this simply by waiting about 12 hours, or so, and it seems to have started happening after I switched over to CGroups and LCE, and turned on stateful RM (using file system). Any ideas what's going on? Thanks! Chris was: Hey Guys, I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. When I look in my AM logs for my app (not an MR app master), I see: 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_000002 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_000002 was assigned task ID 0. Requesting a new container for the task. The AM has been running steadily the whole time. Here's what the NM logs say: {noformat} 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:619) 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) {noformat} And, if I look on the machine that's running container_1371141151815_0008_03_000002, I see: $ ps -ef | grep container_1371141151815_0008_03_000002 criccomi 5365 27915 38 Jun18 ? 21:35:05 /export/apps/jdk/JDK-1_6_0_21/bin/java -cp /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/... The same holds true for container_1371141151815_0006_01_001598. When I look in the container logs, it's just happily running. No kill signal appears to be sent, and no error appears. Lastly, the RM logs show no major events around the time of the leak (5:35am). I am able to reproduce this simply by waiting about 12 hours, or so, and it seems to have started happening after I switched over to CGroups and LCE, and turned on stateful RM (using file system). Any ideas what's going on? Thanks! Chris > YARN NM leaking containers > -------------------------- > > Key: YARN-864 > URL: https://issues.apache.org/jira/browse/YARN-864 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.0.5-alpha > Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and YARN-600. > Reporter: Chris Riccomini > > Hey Guys, > I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. > When I look in my AM logs for my app (not an MR app master), I see: > 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_000002 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. > 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_000002 was assigned task ID 0. Requesting a new container for the task. > The AM has been running steadily the whole time. Here's what the NM logs say: > {noformat} > 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1143) > at java.lang.Thread.join(Thread.java:1196) > at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) > at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) > at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) > at java.lang.Thread.run(Thread.java:619) > 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. > 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 > 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002 > 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. > java.io.IOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) > at org.apache.hadoop.util.Shell.run(Shell.java:129) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) > at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. > java.io.IOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) > at org.apache.hadoop.util.Shell.run(Shell.java:129) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) > at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > {noformat} > And, if I look on the machine that's running container_1371141151815_0008_03_000002, I see: > {noformat} > $ ps -ef | grep container_1371141151815_0008_03_000002 > criccomi 5365 27915 38 Jun18 ? 21:35:05 /export/apps/jdk/JDK-1_6_0_21/bin/java -cp /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/... > {noformat} > The same holds true for container_1371141151815_0006_01_001598. When I look in the container logs, it's just happily running. No kill signal appears to be sent, and no error appears. > Lastly, the RM logs show no major events around the time of the leak (5:35am). I am able to reproduce this simply by waiting about 12 hours, or so, and it seems to have started happening after I switched over to CGroups and LCE, and turned on stateful RM (using file system). > Any ideas what's going on? > Thanks! > Chris -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira