hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <...@yahoo-inc.com>
Subject Re: NameNode fatal crash - 0.18.1
Date Fri, 09 Jan 2009 19:37:18 GMT
Hi, Jonathan.
The problem is that the local drive(s) you use for "dfs.name.dir" became
unaccessible. So the name-node is not able to persist name-space modifications
anymore, and therefore self terminated.
The rest are the consequences.
This is the core message
 > 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error
 > : All storage directories are inaccessible.
Could you please check the drives.
--Konstantin


Jonathan Gray wrote:
> I have a 10+1 node cluster, each slave running DataNode/TaskTracker/HBase
> RegionServer.
> 
> At the time of this crash, NameNode and SecondaryNameNode were both hosted
> on same master.
> 
> We do a nightly backup and about 95% of the way through, HDFS crashed
> with...
> 
> NameNode shows:
> 
> 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable to
> sync edit log. Fatal Error.
> 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error
> : All storage directories are inaccessible.
> 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
> 
> Every single DataNode shows:
> 
> 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode:
> java.io.IOException: Call failed on local exception
>         at org.apache.hadoop.ipc.Client.call(Client.java:718)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>         at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
>         at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655)
>         at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888)
>         at java.lang.Thread.run(Thread.java:636)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
> 
> 
> This is virtually all of the information I have.  At the same time as the
> backup, we have normal HBase traffic and our hourly batch MR jobs.  So slave
> nodes were pretty heavily loaded, but don't see anything in DN logs besides
> this Call failed.  There are no space issues or anything else, Ganglia shows
> high CPU load around this time which has been typical every night, but I
> don't see any issues in DN's or NN about expired leases/no heartbeats/etc. 
> 
> Is there a way to prevent this failure from happening in the first place?  I
> guess just reduce total load across cluster?
> 
> Second question is about how to recover once NameNode does fail...
> 
> When trying to bring HDFS back up, we get hundreds of:
> 
> 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX not
> found in lease.paths
> 
> And then
> 
> 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem:
> FSNamesystem initialization failed.
> 
> 
> Is there a way to recover from this?  As of time of this crash, we had
> SecondaryNameNode on the same node.  Moving it to another node with
> sufficient memory now, but would that even prevent this kind of FS botching?
> 
> Also, my SecondaryNameNode is telling me it cannot connect when trying to do
> a checkpoint:
> 
> 2008-12-15 09:59:48,017 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
> doCheckpoint:
> 2008-12-15 09:59:48,018 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
> java.net.ConnectException: Connection refused
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
> 
> I changed my masters file to just contain the hostname of the
> secondarynamenode, this seems to have properly started the NameNode where I
> launched the ./bin/start-dfs.sh from and started SecondaryNameNode on
> correct node as well.  But it seems to be unable to connect back to primary.
> I have hadoop-site.xml pointing to fs.default.name of primary, but otherwise
> there are not links back.  Where would I specify to the secondary where
> primary is located?
> 
> We're also upgrading to Hadoop 0.19.0 at this time.
> 
> Thank you for any help.
> 
> Jonathan Gray
> 
> 

Mime
View raw message