[ https://issues.apache.org/jira/browse/HADOOP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485688 ] Christian Kunz commented on HADOOP-1182: ---------------------------------------- Increasing the number of handlers in namenode server seems to help a lot -- no ''Call queue overflow discarding oldest call' warn messages anymore. CPU still going up to sustained 99.9% for a while at beginning of the job. We will test more. > DFS Scalability issue with filecache in large clusters > ------------------------------------------------------ > > Key: HADOOP-1182 > URL: https://issues.apache.org/jira/browse/HADOOP-1182 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.12.1 > Reporter: Christian Kunz > > When using filecache to distribute supporting files for map/reduce applications in a 1000 node cluster, many map tasks fail because of timeouts. There was no such problem using a 200 node cluster for the same applications with comparable input data. Either the whole job fails because of too many map failures, or even worse, some map tasks hang indefinitely. > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source) > at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170) > at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327) > at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253) > at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169) > at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86) > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.