hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4298) NodeManager crashed after running out of file descriptors
Date Thu, 31 May 2012 19:45:23 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286869#comment-13286869
] 

Jason Lowe commented on MAPREDUCE-4298:
---------------------------------------

The relevant portion of the NM logfile:

{noformat}
 [New I/O server worker #1-2]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
not found
 [New I/O server worker #1-15]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
not found
 [New I/O server worker #1-23]2012-05-25 21:30:36,719 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
not found
 [New I/O server worker #1-2]2012-05-25 21:30:36,720 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out
not found
 [New I/O server worker #1-17]2012-05-25 21:30:36,720 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
not found
 [New I/O server worker #1-24]2012-05-25 21:30:36,721 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/10/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000030_0/file.out
not found
 [New I/O server worker #1-30]2012-05-25 21:30:36,722 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000314_0/file.out
not found
 [New I/O server worker #1-18]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container container_1337019383242_29334_01_000148 succeeded 
 [ContainersLauncher #2677]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1337019383242_29334_01_000148 transitioned from RUNNING to EXITED_WITH_SUCCESS
 [AsyncDispatcher event handler]2012-05-25 21:30:36,722 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Cleaning up container container_1337019383242_29334_01_000148
 [AsyncDispatcher event handler]2012-05-25 21:30:36,727 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out
not found
 [New I/O server worker #1-7]2012-05-25 21:30:36,727 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out
not found
 [New I/O server worker #1-27]2012-05-25 21:30:36,728 INFO org.apache.hadoop.mapred.ShuffleHandler:
/grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out
not found
 [New I/O server worker #1-13]2012-05-25 21:30:36,728 FATAL org.apache.hadoop.conf.Configuration:
error parsing conf file: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml
(Too many open files)
  [AsyncDispatcher event handler]2012-05-25 21:30:36,728 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher:
Error in dispatcher thread
 [AsyncDispatcher event handler]java.lang.RuntimeException: java.io.FileNotFoundException:
/grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml (Too
many open files)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1723)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1588)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1534)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:581)
        at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:598)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1332)
        at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:142)
        at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:233)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:315)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:313)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
        at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:313)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:426)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:406)
        at org.apache.hadoop.fs.FileContext.getLocalFSFileContext(FileContext.java:392)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:353)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:144)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:54)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.FileNotFoundException: /grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml
(Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:106)
        at java.io.FileInputStream.<init>(FileInputStream.java:66)
        at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70)
        at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1637)
        ... 22 more
2012-05-25 21:30:36,729 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
{noformat}

It's not clear yet exactly why the NM hit the open file ulimit which was set to 8192.  There
were jobs running on the cluster with 4000+ reducers launched (potentially multiple of these
types of jobs simultaneously).

I checked the ShuffleHandler which uses Netty, and I didn't see any limit to the number of
connections Netty would accept.  I wrote a quick Python script that simply connects to the
shuffle port as fast as it can, keeping around all the sockets, and shortly afterwards the
Netty boss thread had crashed with this in the NM logfile:

{noformat}
Exception in thread "ShuffleHandler Netty Boss #0" java.lang.InternalError
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:755)
        at sun.misc.URLClassPath.getResource(URLClassPath.java:169)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:435)
        at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2289)
        at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1364)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1328)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
        at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1224)
        at java.util.ResourceBundle.getBundle(ResourceBundle.java:705)
        at java.util.logging.Level.getLocalizedName(Level.java:223)
        at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:64)
        at java.util.logging.StreamHandler.publish(StreamHandler.java:179)
        at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:88)
        at java.util.logging.Logger.log(Logger.java:481)
        at java.util.logging.Logger.doLog(Logger.java:503)
        at java.util.logging.Logger.logp(Logger.java:703)
        at org.jboss.netty.logging.JdkLogger.warn(JdkLogger.java:86)
        at org.jboss.netty.logging.InternalLoggerFactory$1.warn(InternalLoggerFactory.java:133)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:260)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.jar.JarFile.<init>(JarFile.java:135)
        at java.util.jar.JarFile.<init>(JarFile.java:72)
        at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:646)
        at sun.misc.URLClassPath$JarLoader.access$600(URLClassPath.java:540)
        at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:607)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:599)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:753)
        ... 31 more
{noformat}

So in this case, Netty probably got an EMFILE error when trying to accept a new connection
after hitting the socket limit, and when it tried to log a warning it ended up invoking the
class loader which was unable to get a new file descriptor for the jar file.

Given there were a lot of reducers running, in theory these could all hit a single nodemanager
near the same time when waiting for the last map to finish.  If for some reason the reducer
shuffle connections lingered around long enough, we'd exhaust the file descriptors on the
nodemanager given enough reducers within a short enough time span.  I can't tell from the
logs if this is exactly what happened in this case, but it is concerning that there appears
to be no limit to the number of connections the shuffle handler will accept before we hit
hard OS limits that will crash the NM in various ways.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> A node on one of our clusters fell over because it ran out of open file descriptors.
 Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message