hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anty <anty....@gmail.com>
Subject Fwd: Probably a bug in FsEditLog
Date Sun, 07 Apr 2013 13:44:37 GMT
---------- Forwarded message ----------
From: Anty <anty.rao@gmail.com>
Date: Sun, Apr 7, 2013 at 5:48 PM
Subject: Probably a bug in FsEditLog
To: hdfs-dev@hadoop.apache.org, Chance Li <chancelq@gmail.com>


Hi:ALL

In our cluster, we configure the NameNode to write to both local  and NFS
mounted directories. When the NFS mounted directory is inaccessible, the
NameNode should keep running without error, but our namenode crash with
following stack trace.

2013-04-02 23:35:21,535 WARN org.apache.hadoop.hdfs.server.common.Storage:
> Removing storage dir /nfs2-mount/onest3/dfs/name
> 2013-04-02 23:35:21,536 FATAL
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to find edits
> stream with IO error
> java.lang.Exception: Unable to find edits stream with IO error
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.fatalExit(FSEditLog.java:430)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.removeEditsStreamsAndStorageDirs(FSEditLog.java:519)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1139)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1641)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:689)
>         at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
> 2013-04-02 23:35:21,539 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>

According to the stack trace, When NameNode tries to sync edit log, it does
identify the mounted NFS directory is inaccessible, and attempt to remove
it from the FSEditLog#editStreams. However, it found the edit stream
corresponding to the mounted NFS has already been removed. Under this
circumstance, NameNode just kill itself, aborted!

 After looking through the source code of HDFS, I found there is another
code path of removing edit stream from FSEditLog#editStreams, which can
cause above race condition. In method FSEditLog#logEdit


>    if (getNumEditStreams() < 1)
>         {
>             throw new AssertionError("No edit streams to log to");
>         }
>         long start = FSNamesystem.now();
>         for (int idx = 0; idx < editStreams.size(); idx++)
>         {
>             EditLogOutputStream eStream = editStreams.get(idx);
>             try
>             {
>                 eStream.write(op, writables);
>             }
>             catch (IOException ioe)
>             {
>                 removeEditsAndStorageDir(idx);
>                 idx--;
>             }
>         }
>
>
The cause of this race condition lie in FSEditLog#logSync method, there are
two steps in FSEditLog#logSync

1.    Do sync operation, if one edit stream is accessible, put it into
error stream list.(un-synchronized)
2.    Delete error stream from FSEditLog#editStreams(synchronized)

Step #1 isn’t synchronized, so there is a possibility that after step#1 and
before step #2 the error stream has already been removed from other thread
by invoking FSEditLog#logEdit

If this was exactly a bug, my proposal fix is in method FSEditlog#logsync
ignore or print some warning message instead of aborting the namenode when
error edit stream doesn't exist in FSEditLog#editStreams



-- 
Best Regards
Anty Rao



-- 
Best Regards
Anty Rao

Mime
View raw message