Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 41466 invoked from network); 5 Oct 2010 17:10:08 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Oct 2010 17:10:08 -0000 Received: (qmail 95578 invoked by uid 500); 5 Oct 2010 17:10:07 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 95483 invoked by uid 500); 5 Oct 2010 17:10:07 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 95475 invoked by uid 99); 5 Oct 2010 17:10:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Oct 2010 17:10:07 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.214.176] (HELO mail-iw0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Oct 2010 17:10:00 +0000 Received: by iwn6 with SMTP id 6so8626411iwn.35 for ; Tue, 05 Oct 2010 10:09:39 -0700 (PDT) Received: by 10.231.192.80 with SMTP id dp16mr12467600ibb.39.1286298578841; Tue, 05 Oct 2010 10:09:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.157.19 with HTTP; Tue, 5 Oct 2010 10:09:17 -0700 (PDT) In-Reply-To: <83E09645-A671-4DCE-89A6-D0E1952190A9@mlogiciels.com> References: <83E09645-A671-4DCE-89A6-D0E1952190A9@mlogiciels.com> From: Todd Lipcon Date: Tue, 5 Oct 2010 10:09:17 -0700 Message-ID: Subject: Re: NameNode crash - cannot start dfs - need help To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016363b8800ddf2a50491e1b9b9 X-Virus-Checked: Checked by ClamAV on apache.org --0016363b8800ddf2a50491e1b9b9 Content-Type: text/plain; charset=ISO-8859-1 On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux wrote: > Thank you Todd. > > It does indeed seem like a challenge to find a record boundary, but if I > wanted to do it... here is how I did it in case others are interested in > doing the same. > > It looks like that value (0xFF) is referenced as OP_INVALID in the source > file: > [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java. > > Every record begins with an op code that describes the record. The op > codes are in the range [0,14] (inclusive), except for OP_INVALID. Each > record type (based on op code) appears to have a different format. > Additionally, it seems that the code for each record type has several code > paths to support different versions of the hdfs. > > I looked in the error messages, and found the line number of the exception > within the switch statement in the code (in this case, line 563). That told > me that I was looking for an op code of either 0x00 or 0x09. I noticed that > this particular code path had a record type that looked like this: > [# bytes: name] > > [1:op code][4:int length][2:file system path length][?:file system path > text] > > All I had to do was find a filesystem path, and look 7 bytes before it > started. If the op code was a 0x00 or 0x09, then this was a candidate > record. > > It would have been easier to just search for something from the error > message (i.e. "12862" for me) to find candidate records, but in my case that > was in almost every record. Additionally, it would have also been easier to > just search for instances of the op code, but in my case one of the op codes > (0x00) appears too often in the data to make that useful. If your op code > is 0x03 for example, you will probably have a much easier time of it than I > did. > > I was able to successfully and quickly find record boundaries and replace > the op code with 0xff. After a few records I was back to the NPE exception > that I was getting with a zero length edits file: > > 2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: > Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 > seconds. > 2010-10-05 16:47:39,671 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92) > > One hurdle down, how do I get past the next one? > It's unclear whether you're getting the error in "edits" or "edits.new". >From the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error there (by truncating a few edits from the end), then the later edits in edits.new failed, because they depended on a path that should have been created by "edits". > > (BTW, what if I didn't want to keep my recent edits, and just wanted to > start up the namenode? This is currently expensive downtime; I'd rather > lose a small amount of data and be up and running than continue the down > time). > If you really want to do this, you can remove "edits.new", and replace "edits" with a file containing hex 0xffffffeeff I believe (edits header plus OP_INVALID) -Todd Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote: > > Hi Matt, > > If you want to keep your recent edits, you'll have to place an 0xFF at the > beginning of the most recent edit entry in the edit log. It's a bit tough to > find these boundaries, but you can try applying this patch and rebuilding: > > https://issues.apache.org/jira/browse/hdfs-1378 > > This will tell you the offset of the broken entry ("recent opcodes") and > you can put an 0xff there to tie off the file before the corrupt entry. > > -Todd > > > On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux wrote: > >> The namenode on an otherwise very stable HDFS cluster crashed recently. >> The filesystem filled up on the name node, which I assume is what caused >> the crash. The problem has been fixed, but I cannot get the namenode to >> restart. I am using version CDH3b2 (hadoop-0.20.2+320). >> >> The error is this: >> >> 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 >> seconds. >> 2010-10-05 14:46:55,992 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >> at java.lang.Long.parseLong(Long.java:419) >> at java.lang.Long.parseLong(Long.java:468) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) >> ... >> >> This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends >> editing the edits file with a hex editor, but does not explain where the >> record boundaries are. It is a different exception, but seemed like a >> similar cause, the edits file. I tried removing a line at a time, but the >> error continues, only with a smaller size and edits #: >> >> 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 >> seconds. >> 2010-10-05 14:37:16,638 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >> at java.lang.Long.parseLong(Long.java:419) >> at java.lang.Long.parseLong(Long.java:468) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) >> ... >> >> I tried removing the edits file altogether, but that failed >> with: java.io.IOException: Edits file is not found >> >> I tried with a zero length edits file, so it would at least have a file >> there, but that results in an NPE: >> >> 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds. >> 2010-10-05 14:52:34,776 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NullPointerException >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) >> >> >> Most if not all the files I noticed in the edits file are temporary files >> that will be deleted once this thing gets back up and running anyway. >> There is a closed ticket that might be related: >> https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm >> using seems to already have HDFS-686 (according to >> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html) >> >> What do I have to do to get back up and running? >> >> Thank you for your help, >> >> Matthew >> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera --0016363b8800ddf2a50491e1b9b9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
Thank you Todd.=A0

I= t does indeed seem like a challenge to find a record boundary, but if I wan= ted to do it... =A0 here is how I did it in case others are interested in d= oing the same. =A0

It looks like that value (0xFF) is referenced as OP_INV= ALID in the source file: [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/ser= ver/namenode/FSEditLog.java.=A0

Every record begin= s with an op code that describes the record. =A0The op codes are in the ran= ge [0,14] (inclusive), except for OP_INVALID. =A0Each record type (based on= op code) appears to have a different format. =A0Additionally, it seems tha= t the code for each record type has several code paths to support different= versions of the hdfs. =A0

=A0I looked in the error messages, and found the line n= umber of the exception within the switch statement in the code (in this cas= e, line 563). =A0That told me that I was looking for an op code of either 0= x00 or 0x09. =A0I noticed that this particular code path had a record type = that looked like this:=A0
[# bytes: name]

[1:op code][4:int length][2:f= ile system path length][?:file system path text]

A= ll I had to do was find a filesystem path, and look 7 bytes before it start= ed. =A0If the op code was a 0x00 or 0x09, then this was a candidate record.= =A0

It would have been easier to just search for something = from the error message (i.e. "12862" for me) to find candidate re= cords, but in my case that was in almost every record. =A0Additionally, it = would have also been easier to just search for instances of the op code, bu= t in my case one of the op codes (0x00) appears too often in the data to ma= ke that useful. =A0 If your op code is 0x03 for example, you will probably = have a much easier time of it than I did. =A0

I was able to successfully and quickly find record boun= daries and replace the op code with 0xff. =A0After a few records I was back= to the NPE exception that I was getting with a zero length edits file:=A0<= /div>

2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdf= s.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 = edits # 959 loaded in 0 seconds.
2010-10-05 16:47:39,671 ERROR or= g.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerExcepti= on
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.n= amenode.FSDirectory.addChild(FSDirectory.java:1081)
=A0=A0 =A0 = =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDir= ectory.java:1093)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSDirector= y.addNode(FSDirectory.java:996)
=A0=A0 =A0 =A0 =A0at org.apache.h= adoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:= 199)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSEditLog.= loadFSEdits(FSEditLog.java:627)
=A0=A0 =A0 =A0 = =A0at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.ja= va:1022)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSImage.lo= adFSImage(FSImage.java:830)
=A0=A0 =A0 =A0 =A0at org.apache.hadoo= p.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSDirecto= ry.loadFSImage(FSDirectory.java:92)

One hurdle down, how do I get past the next one?<= /div>

It's unclear whether = you're getting the error in "edits" or "edits.new".= From the above, I'm guessing maybe "edits" is corrupt, so wh= en you fixed the error there (by truncating a few edits from the end), then= the later edits in edits.new failed, because they depended on a path that = should have been created by "edits".

(BTW, what if I didn't want to keep my recent edits, = and just wanted to start up the namenode? =A0 This is currently expensive d= owntime; I'd rather lose a small amount of data and be up and running t= han continue the down time).=A0

If you really want to do this,= you can remove "edits.new", and replace "edits" with a= file containing hex 0xffffffeeff I believe (edits header plus OP_INVALID)<= /div>

-Todd
=A0
Oct 5, 2010, at 8:42 AM, = Todd Lipcon wrote:

Hi Matt= ,

If you want to keep your recent edits, you'll have= to place an 0xFF at the beginning of the most recent edit entry in the edi= t log. It's a bit tough to find these boundaries, but you can try apply= ing this patch and rebuilding:

https://issues.apache.org/jira/browse/hdfs-1378<= /div>

This will tell you the offset of the broken entry = ("recent opcodes") and you can put an 0xff there to tie off the f= ile before the corrupt entry.

-Todd


On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <mdl@mlogiciels.com&= gt; wrote:
The name= node on an otherwise very stable HDFS cluster crashed recently. =A0The file= system filled up on the name node, which I assume is what caused the crash.= =A0 =A0The problem has been fixed, but I cannot get the namenode to restar= t. =A0I am using version CDH3b2 =A0(hadoop-0.20.2+320).=A0

The error is this:=A0

2010-10-= 05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits fi= le /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.<= /div>
2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.N= ameNode: java.lang.NumberFormatException: For input string: "12862^@^@= ^@^@^@^@^@^@"
=A0=A0 =A0 =A0 =A0at java.lang.NumberFormatExc= eption.forInputString(NumberFormatException.java:48)
=A0=A0 =A0 =A0 =A0at java.lang.Long.parseLong(Long.java:419)
=A0=A0 =A0 =A0 =A0at java.lang.Long.parseLong(Long.java:468)
=A0= =A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong= (FSEditLog.java:1355)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSEditLog.= loadFSEdits(FSEditLog.java:563)
=A0=A0 =A0 =A0 =A0at org.apache.h= adoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
=A0=A0 =A0 =A0 =A0...

This page (http://wiki.apache.org/hadoop/TroubleShooti= ng) recommends editing the edits file with a hex editor, but does not e= xplain where the record boundaries are. =A0It is a different exception, but= seemed like a similar cause, the edits file. =A0I tried removing a line at= a time, but the error continues, only with a smaller size and edits #:=A0<= /div>

2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdf= s.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 = edits # 966 loaded in 0 seconds.
2010-10-05 14:37:16,638 ERROR or= g.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatExcept= ion: For input string: "12862^@^@^@^@^@^@^@^@"
=A0=A0 =A0 =A0 =A0at java.lang.NumberFormatException.forInputString(Nu= mberFormatException.java:48)
=A0=A0 =A0 =A0 =A0at java.lang.Long.= parseLong(Long.java:419)
=A0=A0 =A0 =A0 =A0at java.lang.Long.pars= eLong(Long.java:468)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSEditLog.= readLong(FSEditLog.java:1355)
=A0=A0 =A0 =A0 =A0at org.apache.had= oop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSImage.loadF= SEdits(FSImage.java:1022)
=A0=A0 =A0 =A0 =A0...

I tried removing = the edits file altogether, but that failed with:=A0java.io.IOException: Edi= ts file is not found

I tried with a zero length ed= its file, so it would at least have a file there, but that results in an NP= E:=A0

2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdf= s.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits= # 0 loaded in 0 seconds.
2010-10-05 14:52:34,776 ERROR org.apach= e.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSDirector= y.addChild(FSDirectory.java:1081)
=A0=A0 =A0 =A0 =A0at org.apache= .hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSDirec= tory.addNode(FSDirectory.java:996)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSDirector= y.unprotectedAddFile(FSDirectory.java:199)

<= br>
Most if not all the files I noticed in the edits file are tem= porary files that will be deleted once this thing gets back up and running = anyway. =A0 =A0There is a closed ticket that might be related:=A0https://= issues.apache.org/jira/browse/HDFS-686=A0, =A0but the version I'm u= sing seems to already have HDFS-686 (according to=A0htt= p://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html) =A0

What do I have to do to get back up and running?
<= div>
Thank you for your help,=A0

Matthew





--
Todd Lipcon
Software= Engineer, Cloudera



--
Todd Lipcon
Software Engineer, Cloudera --0016363b8800ddf2a50491e1b9b9--