hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).
Date Sun, 22 Mar 2015 18:38:10 GMT

    [ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375098#comment-14375098
] 

zhihai xu commented on YARN-3385:
---------------------------------

The sequence for the Race condition is the following:
1, RM try to remove application application_1426560404988_0132 state from ZKRMStateStore.
{code}
2015-03-17 19:18:48,075 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max
number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, removing
app application_1426560404988_0132 from state store.
2015-03-17 19:18:48,075 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Removing info for app: application_1426560404988_0132
{code}

2. Unluckily ConnectionLoss for the ZK session happened at the same time as RM remove application
state from ZK.
The ZooKeeper server deleted the node successfully, But due to ConnectionLoss, RM didn't know
the operation succeeded.
{code}
2015-03-17 19:18:51,836 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
{code}

3.RM did retry to remove application state to ZK
{code}
2015-03-17 19:18:51,837 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Retrying operation on ZK. Retry no. 1
{code}

4. during the retry, the ZK session is reconnected.
{code}
2015-03-17 19:18:58,924 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server, sessionid = 0x24be28f536e2006, negotiated timeout = 10000
{code}

5. Because the node was already deleted successfully at ZooKeeper in the previous operation,
it will fail with NoNode KeeperException for the retry
{code}
2015-03-17 19:18:58,956 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2015-03-17 19:18:58,956 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Maxed out ZK retries. Giving up!
{code}

6.This NoNode KeeperException will cause removing app failure in RMStateStore
{code}
2015-03-17 19:18:58,956 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error removing app: application_1426560404988_0132
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
{code}

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager
{code}
  protected void notifyStoreOperationFailed(Exception failureCause) {
    RMFatalEventType type;
    if (failureCause instanceof StoreFencedException) {
      type = RMFatalEventType.STATE_STORE_FENCED;
    } else {
      type = RMFatalEventType.STATE_STORE_OP_FAILED;
    }
    rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }
{code}

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent.
{code}
2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED.
Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
{code}


> Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node
deletion(Op.delete).
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3385
>                 URL: https://issues.apache.org/jira/browse/YARN-3385
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>
> Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node
deletion(Op.delete).
> The race condition is similar as YARN-2721 and YARN-3023.
> When the race condition exists for ZK node creation, it should also exist for  ZK node
deletion.
> We see this issue with the following stack trace:
> {code}
> 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED.
Cause:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 	at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
> 	at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> 	at java.lang.Thread.run(Thread.java:745)
> 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message