hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: DataNode gets 'stuck', ends up with two DataNode processes
Date Mon, 09 Mar 2009 19:45:03 GMT
Hey Jason,

Looks like you're on the right track.  We are getting stuck after the  
JVM forks but before it exec's.  It appears to be this bug:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051

However, this is using Java 1.6.0_11, and that bug was marked as fixed  
in 1.6.0_6 :(

Any other ideas?

Brian

On Mar 9, 2009, at 2:21 PM, jason hadoop wrote:

> There were a couple of fork timing errors in the jdk 1.5 that  
> occasionally
> caused a sub process fork to go bad, this could by the du/df being  
> forked
> off by the datanode and dying.
>
> I can't find the references I had saved away at one point, from the  
> java
> forums, but perhaps this will get you started.
>
> http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0
>
> On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman  
> <bbockelm@cse.unl.edu>wrote:
>
>> It's very strange.  It appears that the second process is the  
>> result of a
>> fork call, yet has only one thread running whose gdb backtrace  
>> looks like
>> this:
>>
>> (gdb) bt
>> #0  0x0000003e10c0af8b in __lll_mutex_lock_wait () from
>> /lib64/tls/libpthread.so.0
>> #1  0x0000000000000000 in ?? ()
>>
>> Not very helpful!  I'd normally suspect some strange memory issue,  
>> but I've
>> checked - there was plenty of memory available on the host when the  
>> second
>> process was spawned and we weren't close to the file descriptor  
>> limit.
>>
>>
>> Looking at this issue,
>> https://issues.apache.org/jira/browse/HADOOP-2231
>>
>> it seems that the "df" call is avoidable now that we're in Java 1.6.
>> However, the issue was closed and marked as a duplicate, but  
>> without noting
>> what it was a duplicate of (grrr).  Is there an updated version of  
>> that
>> patch?
>>
>> Brian
>>
>>
>> On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:
>>
>> Philip Zeyliger wrote:
>>>
>>>> Very naively looking at the stack traces, a common theme is that  
>>>> there's
>>>> a
>>>> call out to "df" to find the system capacity.  If you see two  
>>>> data node
>>>> processes, perhaps the fork/exec to call out to "df" is failing  
>>>> in some
>>>> strange way.
>>>>
>>>
>>> that's deep into Java code. OpenJDK gives you more of that source.  
>>> One
>>> option here is to consider some kind of timeouts in the exec, but  
>>> it's
>>> pretty tricky to tack that on round the Java runtime APIs, because  
>>> the
>>> process APIs weren't really designed to be interrupted by other  
>>> threads.
>>>
>>> -steve
>>>
>>> "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
>>>> tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
>>>> [0x0000000042c54000..0x0000000042c54b30]
>>>> java.lang.Thread.State: WAITING (on object monitor)
>>>>       at java.lang.Object.wait(Native Method)
>>>>       at java.lang.Object.wait(Object.java:485)
>>>>       at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java: 
>>>> 64)
>>>>       - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>>>>       at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
>>>>       at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>>>>       at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>>>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>>>>       at org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>       at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>>>>       at
>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset 
>>>> $FSVolume.getCapacity(FSDataset.java:341)
>>>>       at
>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset 
>>>> $FSVolumeSet.getCapacity(FSDataset.java:501)
>>>>       - locked <0x0000002a9ed97078> (a
>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java: 
>>>> 697)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java: 
>>>> 671)
>>>>       at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java: 
>>>> 1105)
>>>>       at java.lang.Thread.run(Thread.java:619)
>>>> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu
>>>>> wrote:
>>>>
>>>>> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
>>>>> job/task
>>>>> trackers) I've noticed many datanodes get 'stuck'. The nodes  
>>>>> themselves
>>>>> seem
>>>>> fine with no network/memory problems, but in every instance I  
>>>>> see two
>>>>> DataNode processes running, and the NameNode logs indicate the  
>>>>> datanode
>>>>> in
>>>>> question simply stopped responding. This state persists until I  
>>>>> come
>>>>> along
>>>>> and kill the DataNode processes and restart the DataNode on that
>>>>> particular
>>>>> machine.
>>>>>
>>>>> I'm at a loss as to why this is happening, so here's all the  
>>>>> relevant
>>>>> information I can think of sharing:
>>>>>
>>>>> hadoop version = 0.19.1-dev, r (we possibly have some custom  
>>>>> patches
>>>>> running, but nothing which would affect HDFS that I'm aware of)
>>>>> number of nodes = ~100
>>>>> HDFS size = ~230TB
>>>>> Java version =
>>>>> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
>>>>> respectively
>>>>>
>>>>> I managed to grab a stack dump via "kill -3" from two of these  
>>>>> problem
>>>>> instances and threw up the logs at
>>>>> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/

>>>>> >
>>>>> <http://cse.unl.edu/%7Eattebury/datanode_problem/>.
>>>>> The .log files honestly show nothing out of the ordinary, and  
>>>>> having
>>>>> very
>>>>> little Java developing experience the .out files mean nothing to  
>>>>> me.
>>>>> It's
>>>>> also worth mentioning that the NameNode logs at the time when  
>>>>> these
>>>>> DataNodes got stuck show nothing out of the ordinary either --  
>>>>> just the
>>>>> expected "lost heartbeat from node <x>" message. The DataNode 

>>>>> daemon
>>>>> (the
>>>>> original process, not the second mysterious one) continues to  
>>>>> respond to
>>>>> web
>>>>> requests like browsing the log directory during this time.
>>>>>
>>>>> Whenever this happens I've just manually done a "kill -9" to  
>>>>> remove the
>>>>> two
>>>>> stuck DataNode processes (I'm not even sure why there's two of  
>>>>> them, as
>>>>> under normal operation there's only one). After killing the  
>>>>> stuck ones,
>>>>> I
>>>>> simply do a "hadoop-daemon.sh start datanode" and all is normal  
>>>>> again.
>>>>> I've
>>>>> not seen any dataloss or corruption as a result of this problem.
>>>>>
>>>>> Has anyone seen anything like this happen before? Out of our  
>>>>> ~100 node
>>>>> cluster I see this problem around once a day, and it seems to just
>>>>> strike
>>>>> random nodes at random times. It happens often enough that I  
>>>>> would be
>>>>> happy
>>>>> to do additional debugging if anyone can tell me how. I'm not a
>>>>> developer at
>>>>> all, so I'm at the end of my knowledge on how to solve this  
>>>>> problem.
>>>>> Thanks
>>>>> for any help!
>>>>>
>>>>>
>>>>> ===============================
>>>>> Garhan Attebury
>>>>> Systems Administrator
>>>>> UNL Research Computing Facility
>>>>> 402-472-7761
>>>>> ===============================
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Steve Loughran                  http://www.1060.org/blogxter/publish/5
>>> Author: Ant in Action           http://antbook.org/
>>>
>>
>>
>
>
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422


Mime
View raw message