hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7801) "IOException:NameNode still not started" cause DFSClient operation failure without retry.
Date Mon, 16 Feb 2015 05:24:11 GMT
zhihai xu created HDFS-7801:
-------------------------------

             Summary: "IOException:NameNode still not started" cause DFSClient operation failure
without retry.
                 Key: HDFS-7801
                 URL: https://issues.apache.org/jira/browse/HDFS-7801
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs-client, namenode
            Reporter: zhihai xu


"IOException:NameNode still not started" cause DFSClient operation failure without retry.
In YARN-1778, TestFSRMStateStore failed randomly, it is due to the "java.io.IOException: NameNode
still not started".
The stack trace for this Exception is the following:
{code}
2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285))
- testFSRMStateStoreClientRetry: Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)

	at org.apache.hadoop.ipc.Client.call(Client.java:1474)
	at org.apache.hadoop.ipc.Client.call(Client.java:1405)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy23.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
	at com.sun.proxy.$Proxy24.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991)
	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961)
	at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973)
	at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273)
2015-02-03 00:09:19,089 INFO  [IPC Server handler 0 on 57792] ipc.Server (Server.java:run(2155))
- IPC Server handler 0 on 57792, call org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs
from 127.0.0.1:57805 Call#14 Retry#1
java.io.IOException: NameNode still not started
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)
{code}
the reason for this random error is
The NameNode constructor [set started flag at the end|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L826].
And it starts [NameNodeRpcServer|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L685]
by calling function initialize before started flag is set.
If the client (which try to call mkdirs) connects to NameNode server before started flag is
set,
the java.io.IOException: "NameNode still not started" will happen, then the test will fail.
If the client connects to NameNode server after started flag is set, the test will succeed.
As discussed in YARN-1778, there are two ways to fix this issue in HDFS.
1. reorder the code in NameNode constructor: move rpcServer.start to the end after started
flag is set.
2. doing retry in DFSClient for IOException:NameNode still not started. We can create a new
RetryPolicy to do retry for this exception.

We need to discuss what is the correct way to fix this issue or
we don’t need to fix this issue if we can guarantee the DFSClient always starts after NameNode
in real world.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message