hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Bart <rb...@cs.washington.edu>
Subject Re: NullPointerException on namenode startup
Date Tue, 15 May 2012 21:31:25 GMT
I think we've fixed this issue by temporarily modifying the Hadoop code to
squelch the NPE - more details below.

This issue seems to happen often.  I find it particularly interesting
that one of these threads was reported on the same day that our
secondary namenode crashed with this error.

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201205.mbox/%3C4FA3DFFA.9080507@bnl.gov%3E

 After reading the internet our options seemed to be:

1. Use the checkpoint and revert to May 4th
2. Manually repair the "edits" or "edits.new" file
3. Temporarily edit the Hadoop code, probably in or around
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits:626 to
catch the NullPointerException, possibly replacing it with some kind
of no-op, or treating it as an EOF

Since the secondary namenode crashed on May 4th, we didn't want to go
back to our last checkpoint since that meant losing 10 days of data.
I didn't know how to begin with #2 so I tried #3 as described by Terry
Healy.  I added a try-catch around line 696.  It originally was:

   INodeFile node = (INodeFile)fsDir.unprotectedAddFile(path,
permissions, blocks, replication, mtime, atime, blockSize);

And it became:

   INodeFile node = null;
   try {
       node = (INodeFile)fsDir.unprotectedAddFile(path, permissions,
blocks, replication, mtime, atime, blockSize);
   }
   catch (NullPointerException e) {
   }

I started up just the namenode and it came up just fine.  After that,
I reverted the code, recompiled, and booted up the system.  Worked
like a charm.

Peace.

On Mon, May 14, 2012 at 11:13 PM, Robert Bart <rbart@cs.washington.edu>wrote:

> Hi all,
>
> I have a cluster of 9 nodes running Hadoop 1.0.2. I am unable to bring our
> the cluster back up after an unsuccessful attempt to restart it. The
> problem appears to be caused by a NullPointerException being thrown during
> startup of the namenode. The exception seems to be thrown while reading one
> of the binary files "edits" or "edits.new" in dfs.name.dir (stack trace
> below)
>
> Googling turns up reports of similar problems, and the most common
> solution seems to require manually repairing one or both of these files.
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200912.mbox/%3Ca03ecfdb0912160654g4d2f6bd0i73bd690bda55b753@mail.gmail.com%3E
>
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201205.mbox/%3C4FA3DFFA.9080507@bnl.gov%3E
>
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201010.mbox/%3C83E09645-A671-4DCE-89A6-D0E1952190A9@mlogiciels.com%3E
>
> Does this indeed appear to be corruption of the files in dfs.name.dir, or
> is it possible that something else is going on? If the former, is manually
> repairing the files using a hex editor really the best way to proceed? Any
> ideas on how to fix this without losing data would be greatly appreciated!
>
> Here is the relevant part of the namenode log:
>
> ************************************************************/
> 2012-05-14 21:29:05,459 INFO hdfs.server.namenode.NameNode: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = rv-n11.cs.washington.edu/128.95.1.25
> STARTUP_MSG:   args = []
> STARTUP_MSG:   vers ion = 1.0.3-SNAPSHOT
> STARTUP_MSG:   build =  -r ; compiled by 'knowall' on Tue Apr 17 10:43:14
> PDT 2012
> ************************************************************/
> 2012-05-14 21:29:05,608 INFO metrics2.impl.MetricsConfig: loaded
> properties from hadoop-metrics2.properties
> 2012-05-14 21:29:05,625 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source MetricsSystem,sub=Stats registered.
> 2012-05-14 21:29:05,626 INFO metrics2.impl.MetricsSystemImpl: Scheduled
> snapshot period at 10 second(s).
> 2012-05-14 21:29:05,626 INFO metrics2.impl.MetricsSystemImpl: NameNode
> metrics system started
> 2012-05-14 21:29:05,754 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source ugi registered.
> 2012-05-14 21:29:05,766 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source jvm registered.
> 2012-05-14 21:29:05,768 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source NameNode registered.
> 2012-05-14 21:29:05,802 INFO hdfs.util.GSet: VM type       = 64-bit
> 2012-05-14 21:29:05,802 INFO hdfs.util.GSet: 2% max memory = 17.77875 MB
> 2012-05-14 21:29:05,803 INFO hdfs.util.GSet: capacity      = 2^21 =
> 2097152 entries
> 2012-05-14 21:29:05,803 INFO hdfs.util.GSet: recommended=2097152,
> actual=2097152
> 2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
> fsOwner=knowall
> 2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
> supergroup=supergroup
> 2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=true
> 2012-05-14 21:29:05,833 INFO hdfs.server.namenode.FSNamesystem:
> dfs.block.invalidate.limit=100
> 2012-05-14 21:29:05,833 INFO hdfs.server.namenode.FSNamesystem:
> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
> accessTokenLifetime=0 min(s)
> 2012-05-14 21:29:06,005 INFO hdfs.server.namenode.FSNamesystem: Registered
> FSNamesystemStateMBean and NameNodeMXBean
> 2012-05-14 21:29:06,032 INFO hdfs.server.namenode.NameNode: Caching file
> names occuring more than 10 times
> 2012-05-14 21:29:06,044 INFO hdfs.server.common.Storage: Number of files =
> 52617
> 2012-05-14 21:29:07,319 INFO hdfs.server.common.Storage: Number of files
> under construction = 34
> 2012-05-14 21:29:07,328 INFO hdfs.server.common.Storage: Image file of
> size 15201146 loaded in 1 seconds.
> 2012-05-14 21:29:07,334 ERROR hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>         at hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1094)
>         at hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1106)
>         at hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1009)
>         at
> hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>         at hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:626)
>         at hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1015)
>         at hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:833)
>         at
> hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>         at
> hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>         at
> hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>         at hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>         at hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>         at hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>         at hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>         at hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>
> 2012-05-14 21:29:07,334 INFO hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at
> rv-n11.cs.washington.edu/128.95.1.25
> ************************************************************/
>
>
> --
> Rob
>
>


-- 
Rob

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message