hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Dyer <rd...@iastate.edu>
Subject Re: Namenode failures
Date Sun, 17 Feb 2013 23:29:21 GMT
On Sun, Feb 17, 2013 at 5:08 PM, Harsh J <harsh@cloudera.com> wrote:

> Hi Robert,
>
> Are you by any chance adding files carrying unusual encoding?


I don't believe so.  The only files I push to HDFS are SequenceFiles (with
protobuf objects in them) and HBase's regions, which again is just protobuf
objects.  I don't use any special encodings in the protobufs.


> If its
> possible, can we be sent a bundle of the corrupted log set (all of the
> dfs.name.dir contents) to inspect what seems to be causing the
> corruption?
>

I can give the logs, dfs data dir(s), and 2nn dirs.

https://www.dropbox.com/s/heijq65pmb3esvd/hdfs-bug.tar.gz


> The only identified (but rarely occurring) bug around this part in
> 1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
> other major corruption bug I know of is already fixed in your version,
> being https://issues.apache.org/jira/browse/HDFS-3652 specifically.
>
> We've not had this report from other users so having a reproduced file
> set (data not required) would be most helpful. If you have logs
> leading to the shutdown and crash as well, that'd be good to have too.
>
> P.s. How exactly are you shutting down the NN each time? A kill -9 or
> a regular SIGTERM shutdown?
>

I shut down the NN with 'bin/stop-dfs.sh'.


>  On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rdyer@iastate.edu> wrote:
> > On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <dontariq@gmail.com>
> wrote:
> >>
> >> You can make use of offine image viewer to diagnose
> >> the fsimage file.
> >
> >
> > Is this not included in the 1.0.x branch?  All of the documentation I
> find
> > for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
> >
> >>
> >> Warm Regards,
> >> Tariq
> >> https://mtariq.jux.com/
> >> cloudfront.blogspot.com
> >>
> >>
> >> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <psybers@gmail.com> wrote:
> >>>
> >>> It just happened again.  This was after a fresh format of HDFS/HBase
> and
> >>> I am attempting to re-import the (backed up) data.
> >>>
> >>>   http://pastebin.com/3fsWCNQY
> >>>
> >>> So now if I restart the namenode, I will lose data from the past 3
> hours.
> >>>
> >>> What is causing this?  How can I avoid it in the future?  Is there an
> >>> easy way to monitor (other than a script grep'ing the logs) the
> checkpoints
> >>> to see when this happens?
> >>>
> >>>
> >>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <psybers@gmail.com>
> wrote:
> >>>>
> >>>> Forgot to mention: Hadoop 1.0.4
> >>>>
> >>>>
> >>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <psybers@gmail.com>
> wrote:
> >>>>>
> >>>>> I am at a bit of wits end here.  Every single time I restart the
> >>>>> namenode, I get this crash:
> >>>>>
> >>>>> 2013-02-16 14:32:42,616 INFO
> >>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
> 168058
> >>>>> loaded in 0 seconds.
> >>>>> 2013-02-16 14:32:42,618 ERROR
> >>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
> >>>>> java.lang.NullPointerException
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
> >>>>>
> >>>>> I am following best practices here, as far as I know.  I have the
> >>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of
> these dirs
> >>>>> have the exact same files in them.
> >>>>>
> >>>>> I also run a secondary checkpoint node.  This one appears to have
> >>>>> started failing a week ago.  So checkpoints were *not* being done
> since
> >>>>> then.  Thus I can get the NN up and running, but with a week old
> data!
> >>>>>
> >>>>> What is going on here?  Why does my NN data *always* wind up causing
> >>>>> this exception over time?  Is there some easy way to get notified
> when the
> >>>>> checkpointing starts to fail?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Robert Dyer
> >>>> rdyer@iastate.edu
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Robert Dyer
> >>> rdyer@iastate.edu
> >>
> >>
> >
> >
> >
> > --
> >
> > Robert Dyer
> > rdyer@iastate.edu
>
>
>
> --
> Harsh J
>



-- 

Robert Dyer
rdyer@iastate.edu

Mime
View raw message