Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1DEF010753 for ; Fri, 16 Aug 2013 20:27:48 +0000 (UTC) Received: (qmail 2930 invoked by uid 500); 16 Aug 2013 20:27:48 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 2880 invoked by uid 500); 16 Aug 2013 20:27:48 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 2871 invoked by uid 99); 16 Aug 2013 20:27:47 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Aug 2013 20:27:47 +0000 Date: Fri, 16 Aug 2013 20:27:47 +0000 (UTC) From: "Steve Loughran (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1073) NM to recognise when it can't span process and stop accepting containers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742568#comment-13742568 ] Steve Loughran commented on YARN-1073: -------------------------------------- My AM asked for more containers than my laptop could provide, which resulted in the container exec operations failing. This triggered launch failure events back to the AM, which responded by trying to launch more containers to keep the desired # the same as the actual. My code clearly needs to recognise this situation, but something at the RM/NM level ought to notice that all containers that a NM is trying to launch is failing, and blacklist that node. Of course, being a single-NM cluster this wouldn't cause my code to recover, but on a real cluster it could. {code} 376683901676, }, attemptId: 1, }, id: 139, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, 2013-08-16 13:13:02,916 [ContainersLauncher #50] WARN containermanager.launcher.ContainerLaunch (ContainerLaunch.java:call(270)) - Failed to launch container. java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478) at java.lang.ProcessBuilder.start(ProcessBuilder.java:457) at org.apache.hadoop.util.Shell.runCommand(Shell.java:401) at org.apache.hadoop.util.Shell.run(Shell.java:373) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578) at org.apache.hadoop.util.Shell.execCommand(Shell.java:667) at org.apache.hadoop.util.Shell.execCommand(Shell.java:650) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305) at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:351) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:668) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2159) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2101) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:144) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:265) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) Caused by: java.io.IOException: error=23, Too many open files in system at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:53) at java.lang.ProcessImpl.start(ProcessImpl.java:91) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 26 more 2013-08-16 13:13:02,969 [ContainersLauncher #49] WARN containermanager.launcher.ContainerLaunch (ContainerLaunch.java:call(270)) - Failed to launch container. java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478) at java.lang.ProcessBuilder.start(ProcessBuilder.java:457) at org.apache.hadoop.util.Shell.runCommand(Shell.java:401) at org.apache.hadoop.util.Shell.run(Shell.java:373) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578) at org.apache.hadoop.util.Shell.execCommand(Shell.java:667) at org.apache.hadoop.util.Shell.execCommand(Shell.java:650) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305) at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:351) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:668) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2159) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2101) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:144) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:265) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) Caused by: java.io.IOException: error=23, Too many open files in system at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:53) at java.lang.ProcessImpl.start(ProcessImpl.java:91) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 26 more 2013-08-16 13:13:03,000 [AsyncDispatcher event handler] INFO containermanager.application.Application (ApplicationImpl.java:transition(277)) - Adding container_1376683901676_0001_01_000139 to application application_1376683901676_0001 2013-08-16 13:13:03,001 [AsyncDispatcher event handler] INFO containermanager.container.Container (ContainerImpl.java:handle(860)) - Container container_1376683901676_0001_01_000128 transitioned from RUNNING to EXITED_WITH_FAILURE 2013-08-16 13:13:03,001 [AsyncDispatcher event handler] INFO containermanager.container.Container (ContainerImpl.java:handle(860)) - Container container_1376683901676_0001_01_000130 transitioned from RUNNING to EXITED_WITH_FAILURE 2013-08-16 13:13:03,002 [AsyncDispatcher event handler] INFO containermanager.container.Container (ContainerImpl.java:handle(860)) - Container container_1376683901676_0001_01_000139 transitioned from NEW to LOCALIZING 2013-08-16 13:13:03,002 [AsyncDispatcher event handler] INFO containermanager.launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(323)) - Cleaning up container container_1376683901676_0001_01_000128 2013-08-16 13:13:03,141 [ContainersLauncher #52] WARN containermanager.launcher.ContainerLaunch (ContainerLaunch.java:call(270)) - Failed to launch container. java.io.FileNotFoundException: /Users/stevel/Projects/Hortonworks/Projects/hoya/target/TestRestartClusterFromArchive/TestRestartClusterFromArchive-localDir-nm-0_0/nmPrivate/application_1376683901676_0001/container_1376683901676_0001_01_000133/.launch_container.sh.crc (Too many open files in system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:194) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:223) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:219) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:282) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:269) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:303) at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:351) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:668) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:221) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) 2013-08-16 13:13:03,144 [ContainersLauncher #44] WARN containermanager.launcher.ContainerLaunch (ContainerLaunch.java:call(270)) - Failed to launch container. java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478) at java.lang.ProcessBuilder.start(ProcessBuilder.java:457) at org.apache.hadoop.util.Shell.runCommand(Shell.java:401) at org.apache.hadoop.util.Shell.run(Shell.java:373) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578) at org.apache.hadoop.util.Shell.execCommand(Shell.java:667) at org.apache.hadoop.util.Shell.execCommand(Shell.java:650) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305) at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:351) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:668) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2159) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2101) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:149) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:265) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) Caused by: java.io.IOException: error=23, Too many open files in system at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:53) at java.lang.ProcessImpl.start(ProcessImpl.java:91) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 26 more 2013-08-16 13:13:03,154 [ContainersLauncher #63] WARN containermanager.launcher.ContainerLaunch (ContainerLaunch.java:call(270)) - Failed to launch container. java.io.FileNotFoundException: /Users/stevel/Projects/Hortonworks/Projects/hoya/target/TestRestartClusterFromArchive/TestRestartClusterFromArchive-localDir-nm-0_0/nmPrivate/application_1376683901676_0001/container_1376683901676_0001_01_000138/launch_container.sh (Too many open files in system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:194) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:223) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:219) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:282) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:269) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:303) at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:668) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:221) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) 2013-08-16 13:13:03,161 [AsyncDispatcher event handler] INFO containermanager.launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(323)) - Cleaning up container container_1376683901676_0001_01_000130 2013-08-16 13:13:03,163 [ContainersLauncher #47] WARN containermanager.launcher.ContainerLaunch (ContainerLaunch.java:call(270)) - Failed to launch container. java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478) at java.lang.ProcessBuilder.start(ProcessBuilder.java:457) at org.apache.hadoop.util.Shell.runCommand(Shell.java:401) at org.apache.hadoop.util.Shell.run(Shell.java:373) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578) at org.apache.hadoop.util.Shell.execCommand(Shell.java:667) at org.apache.hadoop.util.Shell.execCommand(Shell.java:650) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305) at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:351) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:668) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:221) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) Caused by: java.io.IOException: error=23, Too many open files in system at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:53) at java.lang.ProcessImpl.start(ProcessImpl.java:91) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 23 more {code} > NM to recognise when it can't span process and stop accepting containers > ------------------------------------------------------------------------ > > Key: YARN-1073 > URL: https://issues.apache.org/jira/browse/YARN-1073 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.1.0-beta > Environment: OS/X with not enough file handles > Reporter: Steve Loughran > Priority: Minor > > when creating too many containers with a claimed resource use of 0 RAM or vCores, the NM got to the state where exec() was continually failing -but nothing seemed to recognise this and blacklist the node. > Something should be noting that all container launches for an app/container are failing and do something. While AMs can/should code this, NM failure is something at the YARN-level -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira