hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz" <ck...@yahoo-inc.com>
Subject RE: Corrupt DFS edits-file
Date Fri, 08 Dec 2006 23:21:46 GMT
FYI: there is an open issue for this:
HADOOP-745

-Christian 

-----Original Message-----
From: Dhruba Borthakur [mailto:dhruba@yahoo-inc.com] 
Sent: Friday, December 08, 2006 2:46 PM
To: hadoop-user@lucene.apache.org
Subject: RE: Corrupt DFS edits-file

Hi Albert and Espen,

With an eye on debugging more on this issue, I have the following questions:
1. Did you have more than one directory in dfs.name.dir? 
2. Was this a new cluster or was it an existing cluster and was upgraded to
0.9.0 recently?
3. Did any unnatural  Namenode restarts occur immediately before the problem
started occurring?

With an eye on making it easier to recover from such a corruption:
1. Will it help to make the fsimage/edit file ascii, so that it can be
easily edited by hand?
2. Does it make sense for HDFS to automatically create a directory
equivalent to /lost+found? While EditLog processing, if the parent directory
of a file does not exist, the file can go into /lost+found?

Thanks,
dhruba

-----Original Message-----
From: Albert Chern [mailto:albert.chern@gmail.com]
Sent: Friday, December 08, 2006 1:43 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Corrupt DFS edits-file

This happened to me too, but the problem was the OP_MKDIR instructions were
in the wrong order.  That is, in the edits file the parent directory was
created after the child.  Maybe you should check to see if that's the case.

I fixed it by using vi in combination with xxd.  When you have the file open
in vi, press escape and issue the command "%!xxd".  This will convert the
binary file to hexadecimal.  Then you can search through and perform the
necessary edits.  I don't remember what the bytes were, but it was something
like opcode, length of path (in binary), path.  After you're done, issue the
command "%!xxd -r" to revert it to binary.  Remember to back up your files
when you do this!
 I also had to kick off a trailing byte that got tagged on for some reason
during the binary/hex conversion.

Anyhow, this is a serious bug and could lead to data loss for a lot of
people.  I think we should report it.

On 12/8/06, Espen Amble Kolstad <espen@trank.no> wrote:
> Hi,
>
> I run hadoop-0.9-dev and my edits-file has become corrupt. When I try 
> to start the namenode I get the following error:
> 2006-12-08 20:38:57,431 ERROR dfs.NameNode -
> java.io.FileNotFoundException: Parent path does not exist:
> /user/trank/dotno/segments/20061208154235/parse_data/part-00000
>         at
> org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
>         at
> org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
>         at
org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
>
> I've grep'ed through my edits-file, to see what's wrong. It seems the 
> edits-file is missing an OP_MKDIR for 
> /user/trank/dotno/segments/20061208154235/parse_data.
>
> Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
>
> - Espen
>



Mime
View raw message