hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Sheridan <psheri...@millennialmedia.com>
Subject Advice on post mortem of data loss (v 1.0.3)
Date Fri, 01 Feb 2013 16:40:22 GMT
Yesterday, I bounced my DFS cluster.  We realized that "ulimit –u" was, in extreme cases,
preventing the name node from creating threads.  This had only started occurring within the
last day or so.  When I brought the name node back up, it had essentially been rolled back
by one week, and I lost all changes which had been made since then.

There are a few other factors to consider.

  1.  I had 3 locations for dfs.name.dir — one local and two NFS.  (I originally thought
this was 2 local and one NFS when I set it up.)  On 1/24, the day which we effectively rolled
back to, the second NFS mount started showing as FAILED on dfshealth.jsp.  We had seen this
before without issue, so I didn't consider it critical.
  2.  When I brought the name node back up, because of discovering the above, I had changed
dfs.name.dir to 2 local drives and one NFS, excluding the one which had failed.

Reviewing the name node log from the day with the NFS outage, I see:

2013-01-24 16:33:11,794 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable
to sync edit log.
java.io.IOException: Input/output error
        at sun.nio.ch.FileChannelImpl.force0(Native Method)
        at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.flushAndSync(FSEditLog.java:215)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:89)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1015)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1666)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:718)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
2013-01-24 16:33:11,794 WARN org.apache.hadoop.hdfs.server.common.Storage: Removing storage
dir /rdisks/xxxxxxxxxxxxxx

Unfortunately, since I wasn't expecting anything terrible to happen, I didn't look too closely
at the file system while the name node was down.  When I brought it up, the time stamp on
the previous checkpoint directory in the dfs.name.dir was right around the above error message.
 The current directory basically had an fsimage and an empty edits log with the current time

So: what happened?  Should this failure have led to my data loss?  I would have thought the
local directory would be fine in this scenario.  Did I have any other options for data recovery?



View raw message