Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type;
  b=OSX2OZ8DSYPF2nFZ4/PAgaWOvFlC74hFdQ85HxsHIetJ6MsWGfqayzwSI9nyM7yCfV4QFygH4J15od0Kva1fu0XexT8jxKZ2wW2VfscHuR+kalhZaEvAt4s5VmuZPXoPtkXyR00GSG2ypYD7tA2QkHG/xP6iQFi7XKGTUBop+Bk=;
Message-ID: <131471.2544.qm@web110502.mail.gq1.yahoo.com>
References: <BCCFEB17-8464-466B-BD54-125764974AD5@mlogiciels.com>
 <AANLkTinBHmn1X8DLir-c4iBhjA9nh46tnS588CQCNv1h@mail.gmail.com>
 <83E09645-A671-4DCE-89A6-D0E1952190A9@mlogiciels.com>
 <AANLkTi=LsDncn_8sa39w7DzMrTE+7hpUQJYUbkdDdn9W@mail.gmail.com>
 <365348B5-B767-4710-873D-BEE239BC4E3D@mlogiciels.com>
Date: Tue, 5 Oct 2010 11:32:45 -0700 (PDT)
From: Ayon Sinha <ayonsinha@yahoo.com>
Subject: Re: NameNode crash - cannot start dfs - need help
To: hdfs-user@hadoop.apache.org
In-Reply-To: <365348B5-B767-4710-873D-BEE239BC4E3D@mlogiciels.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-2093550388-1286303565=:2544"

--0-2093550388-1286303565=:2544
Content-Type: text/plain; charset=us-ascii

Hi Matthew,
Congratulations. Having a HDFS back is quite a relief and you were lucky enough 
to not loose any files/blocks.
Another thing I ended up doing was to decommission the namenode machine from 
being a data node. That is what had caused the namenode to get out of disk 
space. 
 -Ayon


________________________________
From: Matthew LeMieux <mdl@mlogiciels.com>
To: hdfs-user@hadoop.apache.org
Sent: Tue, October 5, 2010 11:25:57 AM
Subject: Re: NameNode crash - cannot start dfs - need help

Thank you Ayon, Allen and Todd for your suggestions. 

I was tempted to try to find the offending records in edits.new, but opted for 
simply moving the file instead.  I kept the recently edited edits file in 
place. 

The namenode started up this time with no exceptions and appears to be running 
well;  hadoop fsck / reports a healthy filesystem. 

Thank you, 

Matthew

On Oct 5, 2010, at 10:09 AM, Todd Lipcon wrote:

On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
>
>Thank you Todd. 
>>
>>
>>It does indeed seem like a challenge to find a record boundary, but if I wanted 
>>to do it...   here is how I did it in case others are interested in doing the 
>>same.  
>>
>>
>>
>>It looks like that value (0xFF) is referenced as OP_INVALID in the source file: 
>>[hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java. 
>>
>>
>>Every record begins with an op code that describes the record.  The op codes are 
>>in the range [0,14] (inclusive), except for OP_INVALID.  Each record type (based 
>>on op code) appears to have a different format.  Additionally, it seems that the 
>>code for each record type has several code paths to support different versions 
>>of the hdfs.  
>>
>>
>> I looked in the error messages, and found the line number of the exception 
>>within the switch statement in the code (in this case, line 563).  That told me 
>>that I was looking for an op code of either 0x00 or 0x09.  I noticed that this 
>>particular code path had a record type that looked like this: 
>>[# bytes: name]
>>
>>
>>[1:op code][4:int length][2:file system path length][?:file system path text]
>>
>>
>>All I had to do was find a filesystem path, and look 7 bytes before it started. 
>> If the op code was a 0x00 or 0x09, then this was a candidate record. 
>>
>>
>>It would have been easier to just search for something from the error message 
>>(i.e. "12862" for me) to find candidate records, but in my case that was in 
>>almost every record.  Additionally, it would have also been easier to just 
>>search for instances of the op code, but in my case one of the op codes (0x00) 
>>appears too often in the data to make that useful.   If your op code is 0x03 for 
>>example, you will probably have a much easier time of it than I did.  
>>
>>
>>I was able to successfully and quickly find record boundaries and replace the op 
>>code with 0xff.  After a few records I was back to the NPE exception that I was 
>>getting with a zero length edits file: 
>>
>>
>>2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits 
>>file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 seconds.
>>2010-10-05 16:47:39,671 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
>>java.lang.NullPointerException
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>>
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>>
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>>
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)
>>
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)
>>
>>
>>
>>One hurdle down, how do I get past the next one?

It's unclear whether you're getting the error in "edits" or "edits.new". From 
the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error 
there (by truncating a few edits from the end), then the later edits in 
edits.new failed, because they depended on a path that should have been created 
by "edits".

>
>(BTW, what if I didn't want to keep my recent edits, and just wanted to start up 
>the namenode?   This is currently expensive downtime; I'd rather lose a small 
>amount of data and be up and running than continue the down time). 

If you really want to do this, you can remove "edits.new", and replace "edits" 
with a file containing hex 0xffffffeeff I believe (edits header plus OP_INVALID)

-Todd
 
Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:

>
>Hi Matt,
>>
>>
>>If you want to keep your recent edits, you'll have to place an 0xFF at the 
>>beginning of the most recent edit entry in the edit log. It's a bit tough to 
>>find these boundaries, but you can try applying this patch and rebuilding:
>>
>>
>>https://issues.apache.org/jira/browse/hdfs-1378
>>
>>
>>This will tell you the offset of the broken entry ("recent opcodes") and you can 
>>put an 0xff there to tie off the file before the corrupt entry.
>>
>>
>>-Todd
>>
>>
>>
>>
>>On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
>>
>>The namenode on an otherwise very stable HDFS cluster crashed recently.  The 
>>filesystem filled up on the name node, which I assume is what caused the crash. 
>>   The problem has been fixed, but I cannot get the namenode to restart.  I am 
>>using version CDH3b2  (hadoop-0.20.2+320). 
>>>
>>>
>>>The error is this: 
>>>
>>>
>>>2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits 
>>>file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
>>>2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
>>>java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
>>>        at 
>>>java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>>        at java.lang.Long.parseLong(Long.java:419)
>>>        at java.lang.Long.parseLong(Long.java:468)
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>>        ...
>>>
>>>
>>>This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the 
>>>edits file with a hex editor, but does not explain where the record boundaries 
>>>are.  It is a different exception, but seemed like a similar cause, the edits 
>>>file.  I tried removing a line at a time, but the error continues, only with a 
>>>smaller size and edits #: 
>>>
>>>
>>>2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits 
>>>file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
>>>2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
>>>java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
>>>        at 
>>>java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>>        at java.lang.Long.parseLong(Long.java:419)
>>>        at java.lang.Long.parseLong(Long.java:468)
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>>        ...
>>>
>>>
>>>I tried removing the edits file altogether, but that failed 
>>>with: java.io.IOException: Edits file is not found
>>>
>>>
>>>I tried with a zero length edits file, so it would at least have a file there, 
>>>but that results in an NPE: 
>>>
>>>
>>>2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits 
>>>file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
>>>2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
>>>java.lang.NullPointerException
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>>>
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>>>
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>>>        at 
>>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>>>
>>>
>>>
>>>
>>>
>>>Most if not all the files I noticed in the edits file are temporary files that 
>>>will be deleted once this thing gets back up and running anyway.    There is a 
>>>closed ticket that might be 
>>>related: https://issues.apache.org/jira/browse/HDFS-686 ,  but the version I'm 
>>>using seems to already have HDFS-686 (according 
>>>to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)  
>>>
>>>
>>>What do I have to do to get back up and running?
>>>
>>>
>>>Thank you for your help, 
>>>
>>>Matthew
>>>
>>>
>>>
>>>
>>
>>
>>-- 
>>Todd Lipcon
>>Software Engineer, Cloudera
>>
>
>
>
>-- 
>Todd Lipcon
>Software Engineer, Cloudera
>


--0-2093550388-1286303565=:2544
Content-Type: text/html; charset=us-ascii

<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:arial, helvetica, sans-serif;font-size:10pt"><div></div><div>Hi Matthew,</div><div>Congratulations. Having a HDFS back is quite a relief and you were lucky enough to not loose any files/blocks.</div><div>Another thing I ended up doing was to decommission the namenode machine from being a data node. That is what had caused the namenode to get out of disk space.&nbsp;<br>&nbsp;</div>-Ayon<br><div><br></div><div style="font-family:arial, helvetica, sans-serif;font-size:10pt"><br><div style="font-family:times new roman, new york, times, serif;font-size:12pt"><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Matthew LeMieux &lt;mdl@mlogiciels.com&gt;<br><b><span style="font-weight: bold;">To:</span></b> hdfs-user@hadoop.apache.org<br><b><span style="font-weight: bold;">Sent:</span></b> Tue, October 5, 2010
 11:25:57 AM<br><b><span style="font-weight: bold;">Subject:</span></b> Re: NameNode crash - cannot start dfs - need help<br></font><br>
Thank you Ayon, Allen and Todd for your suggestions.&nbsp;<div><br></div><div>I was tempted to try to find the offending records in edits.new, but opted for simply moving the file instead. &nbsp;I kept the recently edited edits file in place.&nbsp;</div><div><br></div><div>The namenode started up this time with no exceptions and appears to be running well; <span class="Apple-tab-span" style="white-space:pre;">	</span>&nbsp;hadoop fsck / reports a healthy filesystem.&nbsp;</div><div><br></div><div>Thank you,&nbsp;</div><div><br></div><div>Matthew</div><div><br></div><div><div><div>On Oct 5, 2010, at 10:09 AM, Todd Lipcon wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux <span dir="ltr">&lt;<a rel="nofollow" ymailto="mailto:mdl@mlogiciels.com" target="_blank" href="mailto:mdl@mlogiciels.com">mdl@mlogiciels.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote
 class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div style="word-wrap:break-word;">Thank you Todd.&nbsp;<div><br></div><div>It does indeed seem like a challenge to find a record boundary, but if I wanted to do it... &nbsp; here is how I did it in case others are interested in doing the same. &nbsp;<br>

<div><br></div><div>It looks like that value (0xFF) is referenced as OP_INVALID in the source file: [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.&nbsp;</div><div><br></div><div>Every record begins with an op code that describes the record. &nbsp;The op codes are in the range [0,14] (inclusive), except for OP_INVALID. &nbsp;Each record type (based on op code) appears to have a different format. &nbsp;Additionally, it seems that the code for each record type has several code paths to support different versions of the hdfs. &nbsp;</div>

<div><br></div><div>&nbsp;I looked in the error messages, and found the line number of the exception within the switch statement in the code (in this case, line 563). &nbsp;That told me that I was looking for an op code of either 0x00 or 0x09. &nbsp;I noticed that this particular code path had a record type that looked like this:&nbsp;</div>

<div>[# bytes: name]</div><div><br></div><div>[1:op code][4:int length][2:file system path length][?:file system path text]</div><div><br></div><div>All I had to do was find a filesystem path, and look 7 bytes before it started. &nbsp;If the op code was a 0x00 or 0x09, then this was a candidate record.&nbsp;</div>

<div><br></div><div>It would have been easier to just search for something from the error message (i.e. "12862" for me) to find candidate records, but in my case that was in almost every record. &nbsp;Additionally, it would have also been easier to just search for instances of the op code, but in my case one of the op codes (0x00) appears too often in the data to make that useful. &nbsp; If your op code is 0x03 for example, you will probably have a much easier time of it than I did. &nbsp;</div>

<div><br></div><div>I was able to successfully and quickly find record boundaries and replace the op code with 0xff. &nbsp;After a few records I was back to the NPE exception that I was getting with a zero length edits file:&nbsp;</div>

<div><br></div><div><div>2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 seconds.</div><div>2010-10-05 16:47:39,671 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException</div>

<div class="im"><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)</div>

<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)</div></div>

<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)</div><div class="im"><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)</div></div>

<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)</div>

</div><div><br></div><div>One hurdle down, how do I get past the next one?</div></div></div></blockquote><div><br></div><div>It's unclear whether you're getting the error in "edits" or "edits.new". From the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error there (by truncating a few edits from the end), then the later edits in edits.new failed, because they depended on a path that should have been created by "edits".</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word;"><div><div><br></div><div>(BTW, what if I didn't want to keep my recent edits, and just wanted to start up the namenode? &nbsp; This is currently expensive downtime; I'd rather lose a small amount of data and be up and running than continue the down time).&nbsp;</div>

</div></div></blockquote><div><br></div><div>If you really want to do this, you can remove "edits.new", and replace "edits" with a file containing hex 0xffffffeeff I believe (edits header plus OP_INVALID)</div>

<div><br></div><div>-Todd</div><div>&nbsp;</div><div>Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word;">

<div><div><div class="h5"><div><div><br><blockquote type="cite">Hi Matt,<div><br></div><div>If you want to keep your recent edits, you'll have to place an 0xFF at the beginning of the most recent edit entry in the edit log. It's a bit tough to find these boundaries, but you can try applying this patch and rebuilding:</div>


<div><br></div><div><a rel="nofollow" target="_blank" href="https://issues.apache.org/jira/browse/hdfs-1378">https://issues.apache.org/jira/browse/hdfs-1378</a></div><div><br></div><div>This will tell you the offset of the broken entry ("recent opcodes") and you can put an 0xff there to tie off the file before the corrupt entry.</div>


<div><br></div><div>-Todd</div><div><br><div><br><div class="gmail_quote">On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <span dir="ltr">&lt;<a rel="nofollow" ymailto="mailto:mdl@mlogiciels.com" target="_blank" href="mailto:mdl@mlogiciels.com">mdl@mlogiciels.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word;">The namenode on an otherwise very stable HDFS cluster crashed recently. &nbsp;The filesystem filled up on the name node, which I assume is what caused the crash. &nbsp; &nbsp;The problem has been fixed, but I cannot get the namenode to restart. &nbsp;I am using version CDH3b2 &nbsp;(hadoop-0.20.2+320).&nbsp;<div>


<br></div><div>The error is this:&nbsp;</div><div><br></div><div><div>2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.</div>


<div>2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)</div>


<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at java.lang.Long.parseLong(Long.java:419)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at java.lang.Long.parseLong(Long.java:468)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)</div>


<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)</div></div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;...</div>


<div><br></div><div><span>This page (<a target="_blank" href="http://wiki.apache.org/hadoop/TroubleShooting">http://wiki.apache.org/hadoop/TroubleShooting</a>) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are. &nbsp;It is a different exception, but seemed like a similar cause, the edits file. &nbsp;I tried removing a line at a time, but the error continues, only with a smaller size and edits #:&nbsp;</span></div>


<div><br></div><div><div>2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.</div><div>2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"</div>


<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at java.lang.Long.parseLong(Long.java:419)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at java.lang.Long.parseLong(Long.java:468)</div>


<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)</div>


<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;...</div></div><div><br></div><div>I tried removing the edits file altogether, but that failed with:&nbsp;<a target="_blank" href="http://java.io.IO">java.io</a>.IOException: Edits file is not found</div><div><br></div><div>I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE:&nbsp;</div>


<div><br></div><div><div>2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.</div><div>2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException</div>


<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)</div><div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)</div>


<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)</div></div><div><br></div><div><br></div><div>Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway. &nbsp; &nbsp;There is a closed ticket that might be related:&nbsp;<a rel="nofollow" target="_blank" href="https://issues.apache.org/jira/browse/HDFS-686">https://issues.apache.org/jira/browse/HDFS-686</a><span>&nbsp;, &nbsp;but the version I'm using seems to already have HDFS-686 (according to&nbsp;<a target="_blank" href="http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html">http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html</a>) &nbsp;</span></div>


<div><br></div><div>What do I have to do to get back up and running?</div><div><br></div><div>Thank you for your help,&nbsp;</div><div><br></div><font color="#888888"><div>Matthew</div><div><br></div><div><br></div></font></div>


</blockquote></div><br><br clear="all"><br>-- <br>Todd Lipcon<br>Software Engineer, Cloudera<br>
</div></div>
</blockquote></div><br></div></div></div></div></div></blockquote></div><br><br clear="all"><br>-- <br>Todd Lipcon<br>Software Engineer, Cloudera<br>
</blockquote></div><br></div></div></div><div style="position:fixed"></div>


</div><br>

      </body></html>
--0-2093550388-1286303565=:2544--