Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 562CB18149 for ; Fri, 30 Oct 2015 07:15:28 +0000 (UTC) Received: (qmail 69486 invoked by uid 500); 30 Oct 2015 07:15:28 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 69437 invoked by uid 500); 30 Oct 2015 07:15:27 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 69426 invoked by uid 99); 30 Oct 2015 07:15:27 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Oct 2015 07:15:27 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id B7B682C1F5D for ; Fri, 30 Oct 2015 07:15:27 +0000 (UTC) Date: Fri, 30 Oct 2015 07:15:27 +0000 (UTC) From: "Varun Saxena (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4127) RM fail with noAuth error if switched from failover mode to non-failover mode MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982027#comment-14982027 ] Varun Saxena commented on YARN-4127: ------------------------------------ [~jianhe] bq. however for the branch-2.7 patch, if I run the test case without the core change, the test will keep in a loop and not finish. could you take a look ? This is because we do not handle NoAuth exception properly in branch-2.7 code when HA is not enabled. In ZKRMStateStore#runWithRetries, we have code as under. As can be seen if HA is not enabled, we neither rethrow NoAuthException nor do we have any logic increment retries and back out if retries are maxed out. With fix in this patch, probably NoAuth will never come until and unless someone changes it from CLI. I will go ahead and file another JIRA. {code} T runWithRetries() throws Exception { int retry = 0; while (true) { try { return runWithCheck(); } catch (KeeperException.NoAuthException nae) { if (HAUtil.isHAEnabled(getConfig())) { // NoAuthException possibly means that this store is fenced due to // another RM becoming active. Even if not, // it is safer to assume we have been fenced throw new StoreFencedException(); } } catch (KeeperException ke) { ............. } } } {code} > RM fail with noAuth error if switched from failover mode to non-failover mode > ------------------------------------------------------------------------------ > > Key: YARN-4127 > URL: https://issues.apache.org/jira/browse/YARN-4127 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.1 > Reporter: Jian He > Assignee: Varun Saxena > Attachments: YARN-4127-branch-2.7.01.patch, YARN-4127.01.patch, YARN-4127.02.patch > > > The scenario is that RM failover was initially enabled, so the zkRootNodeAcl is by default set with the *RM ID* in the ACL string > If RM failover is then switched to be disabled, it cannot load data from ZK and fail with noAuth error. After I reset the root node ACL, it again can access. > {code} > 15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to load/recover state > org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth > at org.apache.zookeeper.KeeperException.create(KeeperException.java:113) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159) > at org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44) > at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129) > at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125) > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) > at org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122) > at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009) > at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985) > at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194) > {code} > the problem may be that in non-failover mode, RM doesn't use the *RM-ID* to connect with ZK and thus fail with no Auth error. > We should be able to switch failover on and off with no interruption to the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)