hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1073) NM to recognise when it can't span process and stop accepting containers
Date Fri, 16 Aug 2013 20:27:47 GMT

    [ https://issues.apache.org/jira/browse/YARN-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742568#comment-13742568
] 

Steve Loughran commented on YARN-1073:
--------------------------------------

My AM asked for more containers than my laptop could provide, which resulted in the container
exec operations failing. This triggered launch failure events back to the AM, which responded
by trying to launch more containers to keep the desired # the same as the actual.

My code clearly needs to recognise this situation, but something at the RM/NM level ought
to notice that all containers that a NM is trying to launch is failing, and blacklist that
node. Of course, being a single-NM cluster this wouldn't cause my code to recover, but on
a real cluster it could.

{code}
376683901676, }, attemptId: 1, }, id: 139, }, state: C_RUNNING, diagnostics: "", exit_status:
-1000, 
2013-08-16 13:13:02,916 [ContainersLauncher #50] WARN  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:call(270)) - Failed to launch container.
java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system
	at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:457)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:401)
	at org.apache.hadoop.util.Shell.run(Shell.java:373)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:667)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:650)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:668)
	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2159)
	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2101)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:144)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:265)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:680)
Caused by: java.io.IOException: error=23, Too many open files in system
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
	at java.lang.ProcessImpl.start(ProcessImpl.java:91)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 26 more
2013-08-16 13:13:02,969 [ContainersLauncher #49] WARN  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:call(270)) - Failed to launch container.
java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system
	at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:457)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:401)
	at org.apache.hadoop.util.Shell.run(Shell.java:373)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:667)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:650)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:668)
	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2159)
	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2101)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:144)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:265)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:680)
Caused by: java.io.IOException: error=23, Too many open files in system
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
	at java.lang.ProcessImpl.start(ProcessImpl.java:91)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 26 more
2013-08-16 13:13:03,000 [AsyncDispatcher event handler] INFO  containermanager.application.Application
(ApplicationImpl.java:transition(277)) - Adding container_1376683901676_0001_01_000139 to
application application_1376683901676_0001
2013-08-16 13:13:03,001 [AsyncDispatcher event handler] INFO  containermanager.container.Container
(ContainerImpl.java:handle(860)) - Container container_1376683901676_0001_01_000128 transitioned
from RUNNING to EXITED_WITH_FAILURE
2013-08-16 13:13:03,001 [AsyncDispatcher event handler] INFO  containermanager.container.Container
(ContainerImpl.java:handle(860)) - Container container_1376683901676_0001_01_000130 transitioned
from RUNNING to EXITED_WITH_FAILURE
2013-08-16 13:13:03,002 [AsyncDispatcher event handler] INFO  containermanager.container.Container
(ContainerImpl.java:handle(860)) - Container container_1376683901676_0001_01_000139 transitioned
from NEW to LOCALIZING
2013-08-16 13:13:03,002 [AsyncDispatcher event handler] INFO  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainer(323)) - Cleaning up container container_1376683901676_0001_01_000128
2013-08-16 13:13:03,141 [ContainersLauncher #52] WARN  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:call(270)) - Failed to launch container.
java.io.FileNotFoundException: /Users/stevel/Projects/Hortonworks/Projects/hoya/target/TestRestartClusterFromArchive/TestRestartClusterFromArchive-localDir-nm-0_0/nmPrivate/application_1376683901676_0001/container_1376683901676_0001_01_000133/.launch_container.sh.crc
(Too many open files in system)
	at java.io.FileOutputStream.open(Native Method)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:223)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:219)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:282)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:269)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:303)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:668)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:221)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:680)
2013-08-16 13:13:03,144 [ContainersLauncher #44] WARN  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:call(270)) - Failed to launch container.
java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system
	at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:457)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:401)
	at org.apache.hadoop.util.Shell.run(Shell.java:373)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:667)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:650)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:668)
	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2159)
	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2101)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:149)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:265)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:680)
Caused by: java.io.IOException: error=23, Too many open files in system
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
	at java.lang.ProcessImpl.start(ProcessImpl.java:91)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 26 more
2013-08-16 13:13:03,154 [ContainersLauncher #63] WARN  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:call(270)) - Failed to launch container.
java.io.FileNotFoundException: /Users/stevel/Projects/Hortonworks/Projects/hoya/target/TestRestartClusterFromArchive/TestRestartClusterFromArchive-localDir-nm-0_0/nmPrivate/application_1376683901676_0001/container_1376683901676_0001_01_000138/launch_container.sh
(Too many open files in system)
	at java.io.FileOutputStream.open(Native Method)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:223)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:219)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:282)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:269)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:303)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:668)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:221)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:680)
2013-08-16 13:13:03,161 [AsyncDispatcher event handler] INFO  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainer(323)) - Cleaning up container container_1376683901676_0001_01_000130
2013-08-16 13:13:03,163 [ContainersLauncher #47] WARN  containermanager.launcher.ContainerLaunch
(ContainerLaunch.java:call(270)) - Failed to launch container.
java.io.IOException: Cannot run program "chmod": error=23, Too many open files in system
	at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:457)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:401)
	at org.apache.hadoop.util.Shell.run(Shell.java:373)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:667)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:650)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:637)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1007)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:575)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:672)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:668)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:668)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:221)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:75)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:680)
Caused by: java.io.IOException: error=23, Too many open files in system
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
	at java.lang.ProcessImpl.start(ProcessImpl.java:91)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 23 more
{code}
                
> NM to recognise when it can't span process and stop accepting containers
> ------------------------------------------------------------------------
>
>                 Key: YARN-1073
>                 URL: https://issues.apache.org/jira/browse/YARN-1073
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.1.0-beta
>         Environment: OS/X with not enough file handles
>            Reporter: Steve Loughran
>            Priority: Minor
>
> when creating too many containers with a claimed resource use of 0 RAM or vCores, the
NM got to the state where exec() was continually failing -but nothing seemed to recognise
this and blacklist the node.
> Something should be noting that all container launches for an app/container are failing
and do something. While AMs can/should code this, NM failure is something at the YARN-level

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message