hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Loddengaard <a...@cloudera.com>
Subject Re: Please help! Corrupt fsimage?
Date Wed, 07 Jul 2010 15:07:49 GMT
Hi Peter,

The edits.new file is used when the edits and fsimage is pulled to the
secondarynamenode.  Here's the process:

1) SNN pulls edits and fsimage
2) NN starts writing edits to edits.new
3) SNN sends new fsimage to NN
4) NN replaces its fsimage with the SNN fsimage
5) NN replaces edits with edits.new

Certainly taking a different fsimage and trying to apply edits to it won't
work.  Your best bet might be to take the 3-day-old fsimage with an empty
edits and delete edits.new.  But before you do any of this, make sure you
completely backup all values for dfs.name.dir and dfs.checkpoint.dir.  What
are the timestamps on the fsimage files in each dfs.name.dir and
dfs.checkpoint.dir?

Do the namenode and secondarynamenode have enough disk space?  Have you
consulted the logs to learn why the SNN/NN didn't properly update the
fsimage and edits log?

Hope this helps.

Alex

On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <peter@bugsoft.nu> wrote:

> Just a little update. We found a working fsimage that was just a couple of
> days older than the corrupt one. We tried to replace the fsimage with the
> working one, and kept the edits and edits.new files, hoping the the latest
> edits would be still in use. However, when starting the namenode, the
> following error message appears. Any thought ideas or hints of how to
> continue? Edit the edits files somehow?
>
> TIA,
> Peter
>
> 2010-07-07 16:21:10,312 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 28372
> 2010-07-07 16:21:11,162 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 8
> 2010-07-07 16:21:11,164 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3315887 loaded in 0 seconds.
> 2010-07-07 16:21:11,164 DEBUG
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks :
> 1
> clientHolder  clientMachine
> 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> FSDirectory.unprotectedDelete: failed to remove
> /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
> does not exist
> 2010-07-07 16:21:11,164 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
>         at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>
> 2010-07-07 16:21:11,165 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> ************************************************************/
>
>
> On Wed, Jul 7, 2010 at 14:46, Peter Falk <peter@bugsoft.nu> wrote:
>
> > Hi,
> >
> > After a restart of our live cluster today, the name node fails to start
> > with the log message seen below. There is a big file called edits.new in
> the
> > "current" folder that seems be the only one that have received changes
> > recently (no changes to the edits or the fsimage for over a month). Is
> that
> > normal?
> >
> > The last change to the edits.new file was right before shutting down the
> > cluster. It seems like the shutdown was unable to store valid fsimage,
> > edits, edits.new files. The secondary name node image does not include
> the
> > edits.new file, only edits and fsimage, which are identical to the name
> > nodes version. So no help from them.
> >
> > Would appreciate any help in understanding what could have gone wrong.
> The
> > shutdown seemed to complete just fine, without any error message. Is
> there
> > any way to recreate the image from the data, or any other way to save our
> > production data?
> >
> > Sincerely,
> > Peter
> >
> > 2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> > Initializing RPC Metrics with hostName=NameNode, port=9000
> > 2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > 2010-07-07 14:30:27,019 DEBUG
> > org.apache.hadoop.security.UserGroupInformation: Unix Login: hbase,hbase
> > 2010-07-07 14:30:27,149 ERROR
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> > initialization failed.
> > java.io.EOFException
> >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> server
> > on 9000
> > 2010-07-07 14:30:27,151 ERROR
> > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
> >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message