hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash
Date Fri, 09 Jan 2015 05:39:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270581#comment-14270581
] 

zhihai xu commented on YARN-3023:
---------------------------------

Yes, you are right. The issue is the same as YARN-2721.

> Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash 
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-3023
>                 URL: https://issues.apache.org/jira/browse/YARN-3023
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>
> Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash.
> The sequence for the Race condition is the following:
> 1, RM Store attempt state to ZK by calling createWithRetries
> {code}
> 2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_000001
MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_000001,
> {code}
> 2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored
attempt state to ZK.
> The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss,
RM didn't know the operation (createWithRetries) is succeeded.
> {code}
> 2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> {code}
> 3.RM did retry to store attempt state to ZK after one second
> {code}
> 2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Retrying operation on ZK. Retry no. 1
> {code}
> 4. during the one second interval, the ZK session is reconnected.
> {code}
> 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
initiating session
> 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 10000
> {code}
> 5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck),
> For the second try, it will fail with NodeExists KeeperException
> {code}
> 2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
> 2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Maxed out ZK retries. Giving up!
> {code}
> 6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore
> {code}
> 2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing appAttempt: appattempt_1418914202950_42363_000001
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
> {code}
> 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager
> {code}
>   protected void notifyStoreOperationFailed(Exception failureCause) {
>     RMFatalEventType type;
>     if (failureCause instanceof StoreFencedException) {
>       type = RMFatalEventType.STATE_STORE_FENCED;
>     } else {
>       type = RMFatalEventType.STATE_STORE_OP_FAILED;
>     }
>     rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
>   }
> {code}
> 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent.
> {code}
> 2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED.
Cause:
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
> 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message