hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2946) Deadlock in ZKRMStateStore
Date Thu, 11 Dec 2014 15:37:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242695#comment-14242695
] 

Rohith commented on YARN-2946:
------------------------------

 I wrote small program(TestYARN2946.java attached) to simulate exact deadlock scenario. The
same naming convention I have used for better understanding same as deadlock involved classes
and its same implementation logic. Running TestYARN2946.java with synchronized keyword in
method updateFencedState() causes deadlock.After the fix i.e by removing synchronized keyword
runs the program without deadlock in while loop. This is only simulation.

In the attached patch, I have done 2 changes
# Removed *synchronized* keyword from method updateFencedState().
# Changed the method updateFencedState() modifier from public to private since it is used
only from method notifyStoreOperationFailed().

Kindly review the analysis and attached patch.

> Deadlock in ZKRMStateStore
> --------------------------
>
>                 Key: YARN-2946
>                 URL: https://issues.apache.org/jira/browse/YARN-2946
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Blocker
>         Attachments: 0001-YARN-2946.patch, TestYARN2946.java
>
>
> Found one deadlock in ZKRMStateStore.
> # Initial stage zkClient is null because of zk disconnected event.
> # When ZKRMstatestore#runWithCheck()  wait(zkSessionTimeout) for zkClient to re establish
zookeeper connection either via synconnected or expired event, it is highly possible that
any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events.
This cause Deadlock in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message