hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4127) RM fail with noAuth error if switched from failover mode to non-failover mode
Date Fri, 30 Oct 2015 07:15:27 GMT

    [ https://issues.apache.org/jira/browse/YARN-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982027#comment-14982027
] 

Varun Saxena commented on YARN-4127:
------------------------------------

[~jianhe]
bq. however for the branch-2.7 patch, if I run the test case without the core change, the
test will keep in a loop and not finish. could you take a look ?
This is because we do not handle NoAuth exception properly in branch-2.7 code when HA is not
enabled.
In ZKRMStateStore#runWithRetries, we have code as under. As can be seen if HA is not enabled,
we neither rethrow NoAuthException nor do we have any logic increment retries and back out
if retries are maxed out.
With fix in this patch, probably NoAuth will never come until and unless someone changes it
from CLI. I will go ahead and file another JIRA.
{code}
    T runWithRetries() throws Exception {
      int retry = 0;
      while (true) {
        try {
          return runWithCheck();
        } catch (KeeperException.NoAuthException nae) {
          if (HAUtil.isHAEnabled(getConfig())) {
            // NoAuthException possibly means that this store is fenced due to
            // another RM becoming active. Even if not,
            // it is safer to assume we have been fenced
            throw new StoreFencedException();
          }
        } catch (KeeperException ke) {
          .............
       }
     }
  }
{code}

> RM fail with noAuth error if switched from failover mode to non-failover mode 
> ------------------------------------------------------------------------------
>
>                 Key: YARN-4127
>                 URL: https://issues.apache.org/jira/browse/YARN-4127
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Jian He
>            Assignee: Varun Saxena
>         Attachments: YARN-4127-branch-2.7.01.patch, YARN-4127.01.patch, YARN-4127.02.patch
>
>
> The scenario is that RM failover was initially enabled, so the zkRootNodeAcl is by default
set with the *RM ID* in the ACL string 
> If RM failover is then switched to be disabled,  it cannot load data from ZK and fail
with noAuth error. After I reset the root node ACL, it again can access.
> {code}
> 15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to load/recover state
> org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
>   at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
>   at org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
>   at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
>   at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)
>   at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009)
>   at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985)
>   at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374)
>   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973)
>   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014)
>   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010)
>   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194)
> {code}
>  the problem may be that in non-failover mode, RM doesn't use the *RM-ID* to connect
with ZK and thus fail with no Auth error.
> We should be able to switch failover on and off with no interruption to the user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message