hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: NameNode crash - cannot start dfs - need help
Date Tue, 05 Oct 2010 15:42:11 GMT
Hi Matt,

If you want to keep your recent edits, you'll have to place an 0xFF at the
beginning of the most recent edit entry in the edit log. It's a bit tough to
find these boundaries, but you can try applying this patch and rebuilding:

https://issues.apache.org/jira/browse/hdfs-1378

This will tell you the offset of the broken entry ("recent opcodes") and you
can put an 0xff there to tie off the file before the corrupt entry.

-Todd


On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <mdl@mlogiciels.com> wrote:

> The namenode on an otherwise very stable HDFS cluster crashed recently.
>  The filesystem filled up on the name node, which I assume is what caused
> the crash.    The problem has been fixed, but I cannot get the namenode to
> restart.  I am using version CDH3b2  (hadoop-0.20.2+320).
>
> The error is this:
>
> 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0
> seconds.
> 2010-10-05 14:46:55,992 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
>         at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>         at java.lang.Long.parseLong(Long.java:419)
>         at java.lang.Long.parseLong(Long.java:468)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>         ...
>
> This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends
> editing the edits file with a hex editor, but does not explain where the
> record boundaries are.  It is a different exception, but seemed like a
> similar cause, the edits file.  I tried removing a line at a time, but the
> error continues, only with a smaller size and edits #:
>
> 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0
> seconds.
> 2010-10-05 14:37:16,638 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
>         at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>         at java.lang.Long.parseLong(Long.java:419)
>         at java.lang.Long.parseLong(Long.java:468)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>         ...
>
> I tried removing the edits file altogether, but that failed
> with: java.io.IOException: Edits file is not found
>
> I tried with a zero length edits file, so it would at least have a file
> there, but that results in an NPE:
>
> 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
> 2010-10-05 14:52:34,776 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>
>
> Most if not all the files I noticed in the edits file are temporary files
> that will be deleted once this thing gets back up and running anyway.
>  There is a closed ticket that might be related:
> https://issues.apache.org/jira/browse/HDFS-686 ,  but the version I'm
> using seems to already have HDFS-686 (according to
> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)
>
> What do I have to do to get back up and running?
>
> Thank you for your help,
>
> Matthew
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message