hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: DataNode gets 'stuck', ends up with two DataNode processes
Date Mon, 09 Mar 2009 18:23:58 GMT
It's very strange.  It appears that the second process is the result  
of a fork call, yet has only one thread running whose gdb backtrace  
looks like this:

(gdb) bt
#0  0x0000003e10c0af8b in __lll_mutex_lock_wait () from /lib64/tls/ 
libpthread.so.0
#1  0x0000000000000000 in ?? ()

Not very helpful!  I'd normally suspect some strange memory issue, but  
I've checked - there was plenty of memory available on the host when  
the second process was spawned and we weren't close to the file  
descriptor limit.


Looking at this issue,
https://issues.apache.org/jira/browse/HADOOP-2231

it seems that the "df" call is avoidable now that we're in Java 1.6.   
However, the issue was closed and marked as a duplicate, but without  
noting what it was a duplicate of (grrr).  Is there an updated version  
of that patch?

Brian

On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:

> Philip Zeyliger wrote:
>> Very naively looking at the stack traces, a common theme is that  
>> there's a
>> call out to "df" to find the system capacity.  If you see two data  
>> node
>> processes, perhaps the fork/exec to call out to "df" is failing in  
>> some
>> strange way.
>
> that's deep into Java code. OpenJDK gives you more of that source.  
> One option here is to consider some kind of timeouts in the exec,  
> but it's pretty tricky to tack that on round the Java runtime APIs,  
> because the process APIs weren't really designed to be interrupted  
> by other threads.
>
> -steve
>
>> "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
>> tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
>> [0x0000000042c54000..0x0000000042c54b30]
>>   java.lang.Thread.State: WAITING (on object monitor)
>> 	at java.lang.Object.wait(Native Method)
>> 	at java.lang.Object.wait(Object.java:485)
>> 	at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
>> 	- locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
>> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>> 	at org.apache.hadoop.util.Shell.run(Shell.java:134)
>> 	at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset 
>> $FSVolume.getCapacity(FSDataset.java:341)
>> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset 
>> $FSVolumeSet.getCapacity(FSDataset.java:501)
>> 	- locked <0x0000002a9ed97078> (a
>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>> 	at  
>> org 
>> .apache 
>> .hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java: 
>> 697)
>> 	at  
>> org 
>> .apache 
>> .hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
>> 	at  
>> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java: 
>> 1105)
>> 	at java.lang.Thread.run(Thread.java:619)
>> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury  
>> <attebury@cse.unl.edu>wrote:
>>> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no  
>>> job/task
>>> trackers) I've noticed many datanodes get 'stuck'. The nodes  
>>> themselves seem
>>> fine with no network/memory problems, but in every instance I see  
>>> two
>>> DataNode processes running, and the NameNode logs indicate the  
>>> datanode in
>>> question simply stopped responding. This state persists until I  
>>> come along
>>> and kill the DataNode processes and restart the DataNode on that  
>>> particular
>>> machine.
>>>
>>> I'm at a loss as to why this is happening, so here's all the  
>>> relevant
>>> information I can think of sharing:
>>>
>>> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
>>> running, but nothing which would affect HDFS that I'm aware of)
>>> number of nodes = ~100
>>> HDFS size = ~230TB
>>> Java version =
>>> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
>>> respectively
>>>
>>> I managed to grab a stack dump via "kill -3" from two of these  
>>> problem
>>> instances and threw up the logs at
>>> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/

>>> >.
>>> The .log files honestly show nothing out of the ordinary, and  
>>> having very
>>> little Java developing experience the .out files mean nothing to  
>>> me. It's
>>> also worth mentioning that the NameNode logs at the time when these
>>> DataNodes got stuck show nothing out of the ordinary either --  
>>> just the
>>> expected "lost heartbeat from node <x>" message. The DataNode  
>>> daemon (the
>>> original process, not the second mysterious one) continues to  
>>> respond to web
>>> requests like browsing the log directory during this time.
>>>
>>> Whenever this happens I've just manually done a "kill -9" to  
>>> remove the two
>>> stuck DataNode processes (I'm not even sure why there's two of  
>>> them, as
>>> under normal operation there's only one). After killing the stuck  
>>> ones, I
>>> simply do a "hadoop-daemon.sh start datanode" and all is normal  
>>> again. I've
>>> not seen any dataloss or corruption as a result of this problem.
>>>
>>> Has anyone seen anything like this happen before? Out of our ~100  
>>> node
>>> cluster I see this problem around once a day, and it seems to just  
>>> strike
>>> random nodes at random times. It happens often enough that I would  
>>> be happy
>>> to do additional debugging if anyone can tell me how. I'm not a  
>>> developer at
>>> all, so I'm at the end of my knowledge on how to solve this  
>>> problem. Thanks
>>> for any help!
>>>
>>>
>>> ===============================
>>> Garhan Attebury
>>> Systems Administrator
>>> UNL Research Computing Facility
>>> 402-472-7761
>>> ===============================
>>>
>>>
>
>
> -- 
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/


Mime
View raw message