Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A82C010644 for ; Fri, 21 Jun 2013 22:02:21 +0000 (UTC) Received: (qmail 40395 invoked by uid 500); 21 Jun 2013 22:02:21 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 40352 invoked by uid 500); 21 Jun 2013 22:02:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 40311 invoked by uid 99); 21 Jun 2013 22:02:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Jun 2013 22:02:21 +0000 Date: Fri, 21 Jun 2013 22:02:21 +0000 (UTC) From: "Chris Riccomini (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13690784#comment-13690784 ] Chris Riccomini commented on YARN-864: -------------------------------------- Hey Jian, I'm testing YARN 2.0.5-alpha with YARN-799, YARN-600, and YARN-688. I'm going to let it run over the weekend. Failures normally start happening within 5-6 hours. I'll keep you posted. Cheers, Chris > YARN NM leaking containers with CGroups > --------------------------------------- > > Key: YARN-864 > URL: https://issues.apache.org/jira/browse/YARN-864 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.0.5-alpha > Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and YARN-600. > Reporter: Chris Riccomini > Attachments: rm-log > > > Hey Guys, > I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. > When I look in my AM logs for my app (not an MR app master), I see: > 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_000002 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. > 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_000002 was assigned task ID 0. Requesting a new container for the task. > The AM has been running steadily the whole time. Here's what the NM logs say: > {noformat} > 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1143) > at java.lang.Thread.join(Thread.java:1196) > at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) > at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) > at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) > at java.lang.Thread.run(Thread.java:619) > 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. > 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 > 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002 > 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. > java.io.IOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) > at org.apache.hadoop.util.Shell.run(Shell.java:129) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) > at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. > java.io.IOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) > at org.apache.hadoop.util.Shell.run(Shell.java:129) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) > at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > {noformat} > And, if I look on the machine that's running container_1371141151815_0008_03_000002, I see: > {noformat} > $ ps -ef | grep container_1371141151815_0008_03_000002 > criccomi 5365 27915 38 Jun18 ? 21:35:05 /export/apps/jdk/JDK-1_6_0_21/bin/java -cp /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/... > {noformat} > The same holds true for container_1371141151815_0006_01_001598. When I look in the container logs, it's just happily running. No kill signal appears to be sent, and no error appears. > Lastly, the RM logs show no major events around the time of the leak (5:35am). I am able to reproduce this simply by waiting about 12 hours, or so, and it seems to have started happening after I switched over to CGroups and LCE, and turned on stateful RM (using file system). > Any ideas what's going on? > Thanks! > Chris -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira