Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8A78917F8B for ; Mon, 16 Feb 2015 05:24:18 +0000 (UTC) Received: (qmail 76646 invoked by uid 500); 16 Feb 2015 05:24:11 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 76530 invoked by uid 500); 16 Feb 2015 05:24:11 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 76518 invoked by uid 99); 16 Feb 2015 05:24:11 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Feb 2015 05:24:11 +0000 Date: Mon, 16 Feb 2015 05:24:11 +0000 (UTC) From: "zhihai xu (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7801) "IOException:NameNode still not started" cause DFSClient operation failure without retry. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 zhihai xu created HDFS-7801: ------------------------------- Summary: "IOException:NameNode still not started" cause DFSCli= ent operation failure without retry. Key: HDFS-7801 URL: https://issues.apache.org/jira/browse/HDFS-7801 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Reporter: zhihai xu "IOException:NameNode still not started" cause DFSClient operation failure = without retry. In YARN-1778, TestFSRMStateStore failed randomly, it is due to the "java.io= .IOException: NameNode still not started". The stack trace for this Exception is the following: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (Tes= tFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still = not started =09at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStart= up(NameNodeRpcServer.java:1876) =09at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameN= odeRpcServer.java:971) =09at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra= nslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) =09at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl= ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java= ) =09at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal= l(ProtobufRpcEngine.java:636) =09at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) =09at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) =09at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) =09at java.security.AccessController.doPrivileged(Native Method) =09at javax.security.auth.Subject.doAs(Subject.java:415) =09at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma= tion.java:1669) =09at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128) =09at org.apache.hadoop.ipc.Client.call(Client.java:1474) =09at org.apache.hadoop.ipc.Client.call(Client.java:1405) =09at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng= ine.java:229) =09at com.sun.proxy.$Proxy23.mkdirs(Unknown Source) =09at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.= mkdirs(ClientNamenodeProtocolTranslatorPB.java:557) =09at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) =09at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.= java:57) =09at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces= sorImpl.java:43) =09at java.lang.reflect.Method.invoke(Method.java:606) =09at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI= nvocationHandler.java:186) =09at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat= ionHandler.java:101) =09at com.sun.proxy.$Proxy24.mkdirs(Unknown Source) =09at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991) =09at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961) =09at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFil= eSystem.java:973) =09at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFil= eSystem.java:969) =09at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkRes= olver.java:81) =09at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(Distribut= edFileSystem.java:969) =09at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSy= stem.java:962) =09at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869) =09at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMSt= ateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364) =09at org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateS= tore$2.run(TestFSRMStateStore.java:273) 2015-02-03 00:09:19,089 INFO [IPC Server handler 0 on 57792] ipc.Server (S= erver.java:run(2155)) - IPC Server handler 0 on 57792, call org.apache.hado= op.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805 Call#14 Retry#1 java.io.IOException: NameNode still not started =09at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStart= up(NameNodeRpcServer.java:1876) =09at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameN= odeRpcServer.java:971) =09at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra= nslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) =09at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl= ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java= ) =09at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal= l(ProtobufRpcEngine.java:636) =09at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) =09at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) =09at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) =09at java.security.AccessController.doPrivileged(Native Method) =09at javax.security.auth.Subject.doAs(Subject.java:415) =09at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma= tion.java:1669) =09at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128) {code} the reason for this random error is The NameNode constructor [set started flag at the end|https://github.com/ap= ache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/ap= ache/hadoop/hdfs/server/namenode/NameNode.java#L826]. And it starts [NameNodeRpcServer|https://github.com/apache/hadoop/blob/trun= k/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/serv= er/namenode/NameNode.java#L685] by calling function initialize before start= ed flag is set. If the client (which try to call mkdirs) connects to NameNode server before= started flag is set, the java.io.IOException: "NameNode still not started" will happen, then the= test will fail. If the client connects to NameNode server after started flag is set, the te= st will succeed. As discussed in YARN-1778, there are two ways to fix this issue in HDFS. 1. reorder the code in NameNode constructor: move rpcServer.start to the en= d after started flag is set. 2. doing retry in DFSClient for IOException:NameNode still not started. We = can create a new RetryPolicy to do retry for this exception. We need to discuss what is the correct way to fix this issue or we don=E2=80=99t need to fix this issue if we can guarantee the DFSClient a= lways starts after NameNode in real world. -- This message was sent by Atlassian JIRA (v6.3.4#6332)