hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: NameNode crash - cannot start dfs - need help
Date Tue, 05 Oct 2010 17:09:17 GMT
On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux <mdl@mlogiciels.com> wrote:

> Thank you Todd.
>
> It does indeed seem like a challenge to find a record boundary, but if I
> wanted to do it...   here is how I did it in case others are interested in
> doing the same.
>
> It looks like that value (0xFF) is referenced as OP_INVALID in the source
> file:
> [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.
>
> Every record begins with an op code that describes the record.  The op
> codes are in the range [0,14] (inclusive), except for OP_INVALID.  Each
> record type (based on op code) appears to have a different format.
>  Additionally, it seems that the code for each record type has several code
> paths to support different versions of the hdfs.
>
>  I looked in the error messages, and found the line number of the exception
> within the switch statement in the code (in this case, line 563).  That told
> me that I was looking for an op code of either 0x00 or 0x09.  I noticed that
> this particular code path had a record type that looked like this:
> [# bytes: name]
>
> [1:op code][4:int length][2:file system path length][?:file system path
> text]
>
> All I had to do was find a filesystem path, and look 7 bytes before it
> started.  If the op code was a 0x00 or 0x09, then this was a candidate
> record.
>
> It would have been easier to just search for something from the error
> message (i.e. "12862" for me) to find candidate records, but in my case that
> was in almost every record.  Additionally, it would have also been easier to
> just search for instances of the op code, but in my case one of the op codes
> (0x00) appears too often in the data to make that useful.   If your op code
> is 0x03 for example, you will probably have a much easier time of it than I
> did.
>
> I was able to successfully and quickly find record boundaries and replace
> the op code with 0xff.  After a few records I was back to the NPE exception
> that I was getting with a zero length edits file:
>
> 2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0
> seconds.
> 2010-10-05 16:47:39,671 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)
>
> One hurdle down, how do I get past the next one?
>

It's unclear whether you're getting the error in "edits" or "edits.new".
>From the above, I'm guessing maybe "edits" is corrupt, so when you fixed the
error there (by truncating a few edits from the end), then the later edits
in edits.new failed, because they depended on a path that should have been
created by "edits".

>
> (BTW, what if I didn't want to keep my recent edits, and just wanted to
> start up the namenode?   This is currently expensive downtime; I'd rather
> lose a small amount of data and be up and running than continue the down
> time).
>

If you really want to do this, you can remove "edits.new", and replace
"edits" with a file containing hex 0xffffffeeff I believe (edits header plus
OP_INVALID)

-Todd

Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:

>
> Hi Matt,
>
> If you want to keep your recent edits, you'll have to place an 0xFF at the
> beginning of the most recent edit entry in the edit log. It's a bit tough to
> find these boundaries, but you can try applying this patch and rebuilding:
>
> https://issues.apache.org/jira/browse/hdfs-1378
>
> This will tell you the offset of the broken entry ("recent opcodes") and
> you can put an 0xff there to tie off the file before the corrupt entry.
>
> -Todd
>
>
> On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <mdl@mlogiciels.com>wrote:
>
>> The namenode on an otherwise very stable HDFS cluster crashed recently.
>>  The filesystem filled up on the name node, which I assume is what caused
>> the crash.    The problem has been fixed, but I cannot get the namenode to
>> restart.  I am using version CDH3b2  (hadoop-0.20.2+320).
>>
>> The error is this:
>>
>> 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage:
>> Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0
>> seconds.
>> 2010-10-05 14:46:55,992 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
>>         at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>         at java.lang.Long.parseLong(Long.java:419)
>>         at java.lang.Long.parseLong(Long.java:468)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>         ...
>>
>> This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends
>> editing the edits file with a hex editor, but does not explain where the
>> record boundaries are.  It is a different exception, but seemed like a
>> similar cause, the edits file.  I tried removing a line at a time, but the
>> error continues, only with a smaller size and edits #:
>>
>> 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage:
>> Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0
>> seconds.
>> 2010-10-05 14:37:16,638 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
>>         at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>         at java.lang.Long.parseLong(Long.java:419)
>>         at java.lang.Long.parseLong(Long.java:468)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>         ...
>>
>> I tried removing the edits file altogether, but that failed
>> with: java.io.IOException: Edits file is not found
>>
>> I tried with a zero length edits file, so it would at least have a file
>> there, but that results in an NPE:
>>
>> 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage:
>> Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
>> 2010-10-05 14:52:34,776 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>>
>>
>> Most if not all the files I noticed in the edits file are temporary files
>> that will be deleted once this thing gets back up and running anyway.
>>  There is a closed ticket that might be related:
>> https://issues.apache.org/jira/browse/HDFS-686 ,  but the version I'm
>> using seems to already have HDFS-686 (according to
>> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)
>>
>> What do I have to do to get back up and running?
>>
>> Thank you for your help,
>>
>> Matthew
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message