hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Please help! Corrupt fsimage?
Date Fri, 09 Jul 2010 14:35:58 GMT

I know this is a little late in the game...

You could have forced the cluster out of safe mode and then use fsck to copy the bad blocks
out to the file system. (See the help on fsck)

While that might not have helped recover lost data, it would have gotten your cloud back.

I would also find out where most of the corruption occurred. It sounds like you may have a
bad disk.

HTH

-Mike


> From: peter@bugsoft.nu
> Date: Wed, 7 Jul 2010 22:25:27 +0200
> Subject: Re: Please help! Corrupt fsimage?
> To: common-user@hadoop.apache.org
> 
> FYI, just a small update. After starting the data nodes, the block reporting
> ratio was only 68% and the name node never went out of safe mode.
> Apparently, too many edits was lost. We have resorted to formatting the
> cluster for now, we have backup of the most essential data and have started
> restoring that data.
> 
> Of course, it is very disappointing with this data loss. We have kept copies
> of datanode data, as well as the corrupt fsimage and edits. If any one have
> any idea of how to restore the data, either by better merging the edits or
> by reconstructing fsimage from the datanode data somehow, please let me
> know!
> 
> Time to get some sleep now, it has been a long day...
> 
> Sincerely,
> Peter
> 
> On Wed, Jul 7, 2010 at 20:03, Peter Falk <peter@bugsoft.nu> wrote:
> 
> > Thanks for the information Alex and Jean-Daniel! We have finally be able to
> > get the namenode to start, after patching the source code according to the
> > attached patch. It is based on the HDFS-1002 patch, but modified and
> > extended to fix additional NPE. It is made for hadoop 0.20.1.
> >
> > There seemed to be some corrupt edits and/or some missing files in fsimage
> > that cause NPE during upstart and merging of the edits into fsimage. Hope
> > that the attached patch may be of some use for people in similar situations.
> > We have not run an fsck yet, waiting for a raw copy of the data node data to
> > complete first. Lets hope that not too much was lost...
> >
> > Sincerely,
> > Peter
> >
> >
> > On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
> >
> >> What Alex said, and also it really looks like
> >> https://issues.apache.org/jira/browse/HDFS-1024 from having the
> >> experience
> >> of that issue.
> >>
> >> J-D
> >>
> >> On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <alex@cloudera.com>
> >> wrote:
> >>
> >> > Hi Peter,
> >> >
> >> > The edits.new file is used when the edits and fsimage is pulled to the
> >> > secondarynamenode.  Here's the process:
> >> >
> >> > 1) SNN pulls edits and fsimage
> >> > 2) NN starts writing edits to edits.new
> >> > 3) SNN sends new fsimage to NN
> >> > 4) NN replaces its fsimage with the SNN fsimage
> >> > 5) NN replaces edits with edits.new
> >> >
> >> > Certainly taking a different fsimage and trying to apply edits to it
> >> won't
> >> > work.  Your best bet might be to take the 3-day-old fsimage with an
> >> empty
> >> > edits and delete edits.new.  But before you do any of this, make sure
> >> you
> >> > completely backup all values for dfs.name.dir and dfs.checkpoint.dir.
> >>  What
> >> > are the timestamps on the fsimage files in each dfs.name.dir and
> >> > dfs.checkpoint.dir?
> >> >
> >> > Do the namenode and secondarynamenode have enough disk space?  Have you
> >> > consulted the logs to learn why the SNN/NN didn't properly update the
> >> > fsimage and edits log?
> >> >
> >> > Hope this helps.
> >> >
> >> > Alex
> >> >
> >> > On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <peter@bugsoft.nu> wrote:
> >> >
> >> > > Just a little update. We found a working fsimage that was just a
> >> couple
> >> > of
> >> > > days older than the corrupt one. We tried to replace the fsimage with
> >> the
> >> > > working one, and kept the edits and edits.new files, hoping the the
> >> > latest
> >> > > edits would be still in use. However, when starting the namenode,
the
> >> > > following error message appears. Any thought ideas or hints of how
to
> >> > > continue? Edit the edits files somehow?
> >> > >
> >> > > TIA,
> >> > > Peter
> >> > >
> >> > > 2010-07-07 16:21:10,312 INFO
> >> > org.apache.hadoop.hdfs.server.common.Storage:
> >> > > Number of files = 28372
> >> > > 2010-07-07 16:21:11,162 INFO
> >> > org.apache.hadoop.hdfs.server.common.Storage:
> >> > > Number of files under construction = 8
> >> > > 2010-07-07 16:21:11,164 INFO
> >> > org.apache.hadoop.hdfs.server.common.Storage:
> >> > > Image file of size 3315887 loaded in 0 seconds.
> >> > > 2010-07-07 16:21:11,164 DEBUG
> >> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> >> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
> >> numblocks
> >> > :
> >> > > 1
> >> > > clientHolder  clientMachine
> >> > > 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange:
DIR*
> >> > > FSDirectory.unprotectedDelete: failed to remove
> >> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
> >> it
> >> > > does not exist
> >> > > 2010-07-07 16:21:11,164 ERROR
> >> > > org.apache.hadoop.hdfs.server.namenode.NameNode:
> >> > > java.lang.NullPointerException
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
> >> > >         at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >> > >        at
> >> > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >> > >        at
> >> > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> >> > >
> >> > > 2010-07-07 16:21:11,165 INFO
> >> > > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> >> > > /************************************************************
> >> > > SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> >> > > ************************************************************/
> >> > >
> >> > >
> >> > > On Wed, Jul 7, 2010 at 14:46, Peter Falk <peter@bugsoft.nu>
wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > After a restart of our live cluster today, the name node fails
to
> >> start
> >> > > > with the log message seen below. There is a big file called
> >> edits.new
> >> > in
> >> > > the
> >> > > > "current" folder that seems be the only one that have received
> >> changes
> >> > > > recently (no changes to the edits or the fsimage for over a month).
> >> Is
> >> > > that
> >> > > > normal?
> >> > > >
> >> > > > The last change to the edits.new file was right before shutting
down
> >> > the
> >> > > > cluster. It seems like the shutdown was unable to store valid
> >> fsimage,
> >> > > > edits, edits.new files. The secondary name node image does not
> >> include
> >> > > the
> >> > > > edits.new file, only edits and fsimage, which are identical to
the
> >> name
> >> > > > nodes version. So no help from them.
> >> > > >
> >> > > > Would appreciate any help in understanding what could have gone
> >> wrong.
> >> > > The
> >> > > > shutdown seemed to complete just fine, without any error message.
Is
> >> > > there
> >> > > > any way to recreate the image from the data, or any other way
to
> >> save
> >> > our
> >> > > > production data?
> >> > > >
> >> > > > Sincerely,
> >> > > > Peter
> >> > > >
> >> > > > 2010-07-07 14:30:26,949 INFO
> >> org.apache.hadoop.ipc.metrics.RpcMetrics:
> >> > > > Initializing RPC Metrics with hostName=NameNode, port=9000
> >> > > > 2010-07-07 14:30:26,960 INFO
> >> org.apache.hadoop.metrics.jvm.JvmMetrics:
> >> > > > Initializing JVM Metrics with processName=NameNode, sessionId=null
> >> > > > 2010-07-07 14:30:27,019 DEBUG
> >> > > > org.apache.hadoop.security.UserGroupInformation: Unix Login:
> >> > hbase,hbase
> >> > > > 2010-07-07 14:30:27,149 ERROR
> >> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> >> > > > initialization failed.
> >> > > > java.io.EOFException
> >> > > >         at
> >> java.io.DataInputStream.readShort(DataInputStream.java:298)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >> > > >         at
> >> > > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >> > > >         at
> >> > > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> >> > > > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> >> > > server
> >> > > > on 9000
> >> > > > 2010-07-07 14:30:27,151 ERROR
> >> > > > org.apache.hadoop.hdfs.server.namenode.NameNode:
> >> java.io.EOFException
> >> > > >         at
> >> java.io.DataInputStream.readShort(DataInputStream.java:298)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >> > > >         at
> >> > > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >> > > >         at
> >> > > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> >> > > >
> >> > >
> >> >
> >>
> >
> >
 		 	   		  
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message