hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Zeyliger <phi...@cloudera.com>
Subject Re: DataNode gets 'stuck', ends up with two DataNode processes
Date Mon, 09 Mar 2009 17:40:13 GMT
Very naively looking at the stack traces, a common theme is that there's a
call out to "df" to find the system capacity.  If you see two data node
processes, perhaps the fork/exec to call out to "df" is failing in some
strange way.

"DataNode: [/hadoop-data/dfs/data]" daemon prio=10
tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:485)
	at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
	- locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
	at org.apache.hadoop.util.Shell.run(Shell.java:134)
	at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
	at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
	at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
	- locked <0x0000002a9ed97078> (a
	at org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
	at java.lang.Thread.run(Thread.java:619)

On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu>wrote:

> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
> trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
> fine with no network/memory problems, but in every instance I see two
> DataNode processes running, and the NameNode logs indicate the datanode in
> question simply stopped responding. This state persists until I come along
> and kill the DataNode processes and restart the DataNode on that particular
> machine.
> I'm at a loss as to why this is happening, so here's all the relevant
> information I can think of sharing:
> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
> running, but nothing which would affect HDFS that I'm aware of)
> number of nodes = ~100
> HDFS size = ~230TB
> Java version =
> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
> respectively
> I managed to grab a stack dump via "kill -3" from two of these problem
> instances and threw up the logs at
> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>.
> The .log files honestly show nothing out of the ordinary, and having very
> little Java developing experience the .out files mean nothing to me. It's
> also worth mentioning that the NameNode logs at the time when these
> DataNodes got stuck show nothing out of the ordinary either -- just the
> expected "lost heartbeat from node <x>" message. The DataNode daemon (the
> original process, not the second mysterious one) continues to respond to web
> requests like browsing the log directory during this time.
> Whenever this happens I've just manually done a "kill -9" to remove the two
> stuck DataNode processes (I'm not even sure why there's two of them, as
> under normal operation there's only one). After killing the stuck ones, I
> simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
> not seen any dataloss or corruption as a result of this problem.
> Has anyone seen anything like this happen before? Out of our ~100 node
> cluster I see this problem around once a day, and it seems to just strike
> random nodes at random times. It happens often enough that I would be happy
> to do additional debugging if anyone can tell me how. I'm not a developer at
> all, so I'm at the end of my knowledge on how to solve this problem. Thanks
> for any help!
> ===============================
> Garhan Attebury
> Systems Administrator
> UNL Research Computing Facility
> 402-472-7761
> ===============================

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message