hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Murali Krishna" <mura...@yahoo-inc.com>
Subject All datanodes getting marked as dead
Date Sun, 15 Jun 2008 13:47:10 GMT

            I was running some M/R job on a 90+ node cluster. While the
job was running the entire data nodes seems to have become dead. Only
major error I saw in the name node log is 'java.io.IOException: Too many
open files'. The job might try to open thousands of file.

            After some time, there are lot of exceptions saying 'could
only be replicated to 0 nodes instead of 1'. So looks like all the data
nodes are not responding now; job has failed since it couldn't write. I
can see the following in the data nodes logs:

            2008-06-15 02:38:28,477 WARN org.apache.hadoop.dfs.DataNode:
java.net.SocketTimeoutException: timed out waiting for rpc response

        at org.apache.hadoop.ipc.Client.call(Client.java:484)

        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)

        at org.apache.hadoop.dfs.$Proxy0.sendHeartbeat(Unknown Source)


All processes (datanodes + namenodes) are still running..(dfs health
status page shows all nodes as dead)


Some questions:

*         Is this kind of behavior expected when name node runs out of
file handles?

*         Why the data nodes are not able to send the heart beat (is it
related to name node not having enough handles?)

*         What happens to the data in the hdfs when all the data nodes
fail to send the heart beat and name node is in this state?

*         Is the solution is to just increase the number of file handles
and restart the cluster? 




  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message