hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ayon Sinha <ayonsi...@yahoo.com>
Subject Re: NameNode crash - cannot start dfs - need help
Date Tue, 05 Oct 2010 16:20:21 GMT
We had almost exact problem of namenode filling up and namnode failing at this 
exact same point. Since you have created space now you can copy over the 
edits.new, fsimage and the other 2 files from your 
/mnt/namesecondarynode/current and try restarting the namenode.
I believe you will loose some edits and probably some blocks of some files but 
we could recover most of our files.
 -Ayon





________________________________
From: Matthew LeMieux <mdl@mlogiciels.com>
To: hdfs-user@hadoop.apache.org
Sent: Tue, October 5, 2010 8:16:15 AM
Subject: NameNode crash - cannot start dfs - need help

The namenode on an otherwise very stable HDFS cluster crashed recently.  The 
filesystem filled up on the name node, which I assume is what caused the crash. 
   The problem has been fixed, but I cannot get the namenode to restart.  I am 
using version CDH3b2  (hadoop-0.20.2+320). 

The error is this: 

2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits 
file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
        ...

This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the 
edits file with a hex editor, but does not explain where the record boundaries 
are.  It is a different exception, but seemed like a similar cause, the edits 
file.  I tried removing a line at a time, but the error continues, only with a 
smaller size and edits #: 

2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits 
file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
        ...

I tried removing the edits file altogether, but that failed 
with: java.io.IOException: Edits file is not found

I tried with a zero length edits file, so it would at least have a file there, 
but that results in an NPE: 

2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits 
file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)

        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)

        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)



Most if not all the files I noticed in the edits file are temporary files that 
will be deleted once this thing gets back up and running anyway.    There is a 
closed ticket that might be 
related: https://issues.apache.org/jira/browse/HDFS-686 ,  but the version I'm 
using seems to already have HDFS-686 (according 
to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)  

What do I have to do to get back up and running?

Thank you for your help, 

Matthew


      
Mime
View raw message