hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Please help! Corrupt fsimage?
Date Wed, 07 Jul 2010 15:31:46 GMT
What Alex said, and also it really looks like
https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
of that issue.

J-D

On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <alex@cloudera.com> wrote:

> Hi Peter,
>
> The edits.new file is used when the edits and fsimage is pulled to the
> secondarynamenode.  Here's the process:
>
> 1) SNN pulls edits and fsimage
> 2) NN starts writing edits to edits.new
> 3) SNN sends new fsimage to NN
> 4) NN replaces its fsimage with the SNN fsimage
> 5) NN replaces edits with edits.new
>
> Certainly taking a different fsimage and trying to apply edits to it won't
> work.  Your best bet might be to take the 3-day-old fsimage with an empty
> edits and delete edits.new.  But before you do any of this, make sure you
> completely backup all values for dfs.name.dir and dfs.checkpoint.dir.  What
> are the timestamps on the fsimage files in each dfs.name.dir and
> dfs.checkpoint.dir?
>
> Do the namenode and secondarynamenode have enough disk space?  Have you
> consulted the logs to learn why the SNN/NN didn't properly update the
> fsimage and edits log?
>
> Hope this helps.
>
> Alex
>
> On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <peter@bugsoft.nu> wrote:
>
> > Just a little update. We found a working fsimage that was just a couple
> of
> > days older than the corrupt one. We tried to replace the fsimage with the
> > working one, and kept the edits and edits.new files, hoping the the
> latest
> > edits would be still in use. However, when starting the namenode, the
> > following error message appears. Any thought ideas or hints of how to
> > continue? Edit the edits files somehow?
> >
> > TIA,
> > Peter
> >
> > 2010-07-07 16:21:10,312 INFO
> org.apache.hadoop.hdfs.server.common.Storage:
> > Number of files = 28372
> > 2010-07-07 16:21:11,162 INFO
> org.apache.hadoop.hdfs.server.common.Storage:
> > Number of files under construction = 8
> > 2010-07-07 16:21:11,164 INFO
> org.apache.hadoop.hdfs.server.common.Storage:
> > Image file of size 3315887 loaded in 0 seconds.
> > 2010-07-07 16:21:11,164 DEBUG
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks
> :
> > 1
> > clientHolder  clientMachine
> > 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> > FSDirectory.unprotectedDelete: failed to remove
> > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
> > does not exist
> > 2010-07-07 16:21:11,164 ERROR
> > org.apache.hadoop.hdfs.server.namenode.NameNode:
> > java.lang.NullPointerException
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
> >         at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >        at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >        at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> >
> > 2010-07-07 16:21:11,165 INFO
> > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> > /************************************************************
> > SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> > ************************************************************/
> >
> >
> > On Wed, Jul 7, 2010 at 14:46, Peter Falk <peter@bugsoft.nu> wrote:
> >
> > > Hi,
> > >
> > > After a restart of our live cluster today, the name node fails to start
> > > with the log message seen below. There is a big file called edits.new
> in
> > the
> > > "current" folder that seems be the only one that have received changes
> > > recently (no changes to the edits or the fsimage for over a month). Is
> > that
> > > normal?
> > >
> > > The last change to the edits.new file was right before shutting down
> the
> > > cluster. It seems like the shutdown was unable to store valid fsimage,
> > > edits, edits.new files. The secondary name node image does not include
> > the
> > > edits.new file, only edits and fsimage, which are identical to the name
> > > nodes version. So no help from them.
> > >
> > > Would appreciate any help in understanding what could have gone wrong.
> > The
> > > shutdown seemed to complete just fine, without any error message. Is
> > there
> > > any way to recreate the image from the data, or any other way to save
> our
> > > production data?
> > >
> > > Sincerely,
> > > Peter
> > >
> > > 2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> > > Initializing RPC Metrics with hostName=NameNode, port=9000
> > > 2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > > 2010-07-07 14:30:27,019 DEBUG
> > > org.apache.hadoop.security.UserGroupInformation: Unix Login:
> hbase,hbase
> > > 2010-07-07 14:30:27,149 ERROR
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> > > initialization failed.
> > > java.io.EOFException
> > >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > >         at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> > server
> > > on 9000
> > > 2010-07-07 14:30:27,151 ERROR
> > > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
> > >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > >         at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message