Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Mon, 2 Dec 2013 23:52:36 +0000 (UTC)
From: "Jimmy Xiang (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12680801.1385148303508.58094.1386028356401@arcas>
In-Reply-To: <JIRA.12680801.1385148303508@arcas>
References: <JIRA.12680801.1385148303508@arcas>
Subject: [jira] [Commented] (HDFS-5555) CacheAdmin commands fail when first
 listed NameNode is in Standby
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-5555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837092#comment-13837092 ] 

Jimmy Xiang commented on HDFS-5555:
-----------------------------------

DFSClient uses RetryInvocationHandler in this case while the iterator uses ProtobufRpcEngine#Invoker which doesn't handle failover. One fix is to throw StandbyException at the beginning instead of returning an iterator. The other fix is to make sure the iterator supports failover as well.  If we throw  StandbyException at the beginning, the issue will comes up again when iterating the iterator while NN fails over in the middle.

> CacheAdmin commands fail when first listed NameNode is in Standby
> -----------------------------------------------------------------
>
>                 Key: HDFS-5555
>                 URL: https://issues.apache.org/jira/browse/HDFS-5555
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: caching
>    Affects Versions: 3.0.0
>            Reporter: Stephen Chu
>            Assignee: Jimmy Xiang
>
> I am on a HA-enabled cluster. The NameNodes are on host-1 and host-2.
> In the configurations, we specify the host-1 NN first and the host-2 NN afterwards in the _dfs.ha.namenodes.ns1_ property (where _ns1_ is the name of the nameservice).
> If the host-1 NN is Standby and the host-2 NN is Active, some CacheAdmins will fail complaining about operation not supported in standby state.
> e.g.
> {code}
> bash-4.1$ hdfs cacheadmin -removeDirectives -path /user/hdfs2
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1501)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1082)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.listCacheDirectives(FSNamesystem.java:6892)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer$ServerSideCacheEntriesIterator.makeRequest(NameNodeRpcServer.java:1263)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer$ServerSideCacheEntriesIterator.makeRequest(NameNodeRpcServer.java:1249)
> 	at org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77)
> 	at org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85)
> 	at org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.listCacheDirectives(ClientNamenodeProtocolServerSideTranslatorPB.java:1087)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1499)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1348)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1301)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> 	at com.sun.proxy.$Proxy9.listCacheDirectives(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB$CacheEntriesIterator.makeRequest(ClientNamenodeProtocolTranslatorPB.java:1079)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB$CacheEntriesIterator.makeRequest(ClientNamenodeProtocolTranslatorPB.java:1064)
> 	at org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77)
> 	at org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85)
> 	at org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$32.hasNext(DistributedFileSystem.java:1704)
> 	at org.apache.hadoop.hdfs.tools.CacheAdmin$RemoveCacheDirectiveInfosCommand.run(CacheAdmin.java:372)
> 	at org.apache.hadoop.hdfs.tools.CacheAdmin.run(CacheAdmin.java:84)
> 	at org.apache.hadoop.hdfs.tools.CacheAdmin.main(CacheAdmin.java:89)
> {code}
> After manually failing over from host-2 to host-1, the CacheAdmin commands succeed.
> The affected commands are:
> -listPools
> -listDirectives
> -removeDirectives


--
This message was sent by Atlassian JIRA
(v6.1#6144)