hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Falk <pe...@bugsoft.nu>
Subject Re: Please help! Corrupt fsimage?
Date Wed, 07 Jul 2010 18:03:55 GMT
Thanks for the information Alex and Jean-Daniel! We have finally be able to
get the namenode to start, after patching the source code according to the
attached patch. It is based on the HDFS-1002 patch, but modified and
extended to fix additional NPE. It is made for hadoop 0.20.1.

There seemed to be some corrupt edits and/or some missing files in fsimage
that cause NPE during upstart and merging of the edits into fsimage. Hope
that the attached patch may be of some use for people in similar situations.
We have not run an fsck yet, waiting for a raw copy of the data node data to
complete first. Lets hope that not too much was lost...

Sincerely,
Peter

On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> What Alex said, and also it really looks like
> https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
> of that issue.
>
> J-D
>
> On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <alex@cloudera.com>
> wrote:
>
> > Hi Peter,
> >
> > The edits.new file is used when the edits and fsimage is pulled to the
> > secondarynamenode.  Here's the process:
> >
> > 1) SNN pulls edits and fsimage
> > 2) NN starts writing edits to edits.new
> > 3) SNN sends new fsimage to NN
> > 4) NN replaces its fsimage with the SNN fsimage
> > 5) NN replaces edits with edits.new
> >
> > Certainly taking a different fsimage and trying to apply edits to it
> won't
> > work.  Your best bet might be to take the 3-day-old fsimage with an empty
> > edits and delete edits.new.  But before you do any of this, make sure you
> > completely backup all values for dfs.name.dir and dfs.checkpoint.dir.
>  What
> > are the timestamps on the fsimage files in each dfs.name.dir and
> > dfs.checkpoint.dir?
> >
> > Do the namenode and secondarynamenode have enough disk space?  Have you
> > consulted the logs to learn why the SNN/NN didn't properly update the
> > fsimage and edits log?
> >
> > Hope this helps.
> >
> > Alex
> >
> > On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <peter@bugsoft.nu> wrote:
> >
> > > Just a little update. We found a working fsimage that was just a couple
> > of
> > > days older than the corrupt one. We tried to replace the fsimage with
> the
> > > working one, and kept the edits and edits.new files, hoping the the
> > latest
> > > edits would be still in use. However, when starting the namenode, the
> > > following error message appears. Any thought ideas or hints of how to
> > > continue? Edit the edits files somehow?
> > >
> > > TIA,
> > > Peter
> > >
> > > 2010-07-07 16:21:10,312 INFO
> > org.apache.hadoop.hdfs.server.common.Storage:
> > > Number of files = 28372
> > > 2010-07-07 16:21:11,162 INFO
> > org.apache.hadoop.hdfs.server.common.Storage:
> > > Number of files under construction = 8
> > > 2010-07-07 16:21:11,164 INFO
> > org.apache.hadoop.hdfs.server.common.Storage:
> > > Image file of size 3315887 loaded in 0 seconds.
> > > 2010-07-07 16:21:11,164 DEBUG
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
> numblocks
> > :
> > > 1
> > > clientHolder  clientMachine
> > > 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> > > FSDirectory.unprotectedDelete: failed to remove
> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
> it
> > > does not exist
> > > 2010-07-07 16:21:11,164 ERROR
> > > org.apache.hadoop.hdfs.server.namenode.NameNode:
> > > java.lang.NullPointerException
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > >        at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > >        at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > >
> > > 2010-07-07 16:21:11,165 INFO
> > > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> > > /************************************************************
> > > SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> > > ************************************************************/
> > >
> > >
> > > On Wed, Jul 7, 2010 at 14:46, Peter Falk <peter@bugsoft.nu> wrote:
> > >
> > > > Hi,
> > > >
> > > > After a restart of our live cluster today, the name node fails to
> start
> > > > with the log message seen below. There is a big file called edits.new
> > in
> > > the
> > > > "current" folder that seems be the only one that have received
> changes
> > > > recently (no changes to the edits or the fsimage for over a month).
> Is
> > > that
> > > > normal?
> > > >
> > > > The last change to the edits.new file was right before shutting down
> > the
> > > > cluster. It seems like the shutdown was unable to store valid
> fsimage,
> > > > edits, edits.new files. The secondary name node image does not
> include
> > > the
> > > > edits.new file, only edits and fsimage, which are identical to the
> name
> > > > nodes version. So no help from them.
> > > >
> > > > Would appreciate any help in understanding what could have gone
> wrong.
> > > The
> > > > shutdown seemed to complete just fine, without any error message. Is
> > > there
> > > > any way to recreate the image from the data, or any other way to save
> > our
> > > > production data?
> > > >
> > > > Sincerely,
> > > > Peter
> > > >
> > > > 2010-07-07 14:30:26,949 INFO
> org.apache.hadoop.ipc.metrics.RpcMetrics:
> > > > Initializing RPC Metrics with hostName=NameNode, port=9000
> > > > 2010-07-07 14:30:26,960 INFO
> org.apache.hadoop.metrics.jvm.JvmMetrics:
> > > > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > > > 2010-07-07 14:30:27,019 DEBUG
> > > > org.apache.hadoop.security.UserGroupInformation: Unix Login:
> > hbase,hbase
> > > > 2010-07-07 14:30:27,149 ERROR
> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> > > > initialization failed.
> > > > java.io.EOFException
> > > >         at
> java.io.DataInputStream.readShort(DataInputStream.java:298)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > > >         at
> > > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > > > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> > > server
> > > > on 9000
> > > > 2010-07-07 14:30:27,151 ERROR
> > > > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
> > > >         at
> java.io.DataInputStream.readShort(DataInputStream.java:298)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > > >         at
> > > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> > > >
> > >
> >
>

Mime
View raw message