zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.
Date Fri, 19 May 2017 21:48:04 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018073#comment-16018073
] 

ASF GitHub Bot commented on ZOOKEEPER-2778:
-------------------------------------------

Github user afine commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/247#discussion_r117582004
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -682,27 +682,19 @@ public void setQuorumAddress(InetSocketAddress addr){
         }
     
         public InetSocketAddress getElectionAddress(){
    -        synchronized (QV_LOCK) {
    -            return myElectionAddr;
    -        }
    +        return myElectionAddr;
    --- End diff --
    
    > All set code path was protected by QV_LOCK already, which implies that whoever calls
set* should already acquire the QV_LOCK.
    
    Not sure about this one. `setElectionAddress` is called by `recreateSocketAddresses` which
is called by `QuorumCnxManager#Listener.run` without acquiring QV_LOCK. Not sure what the
implication of this is. Although I believe you are correct about `setClientAddress`.
    
    > if we get out dated addr (in case the current quorum peer is being reconfigured)
and sent this to another peer, another peer will not able to connect but that's fine, it will
retry until at certain point later it will get correct information.
    
    What is the behavior if we are able to connect to the "incorrect peer". Will we eventually
disconnect or do we stay connected until reconfiguration occurs again?


> Potential server deadlock between follower sync with leader and follower receiving external
connection requests.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2778
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.5.3
>            Reporter: Michael Han
>            Assignee: Michael Han
>            Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest [1]. . Here
is a sample thread dump that illustrates the state of the execution:
> {noformat}
>     [junit]  java.lang.Thread.State: BLOCKED
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
>     [junit] 
>     [junit]  java.lang.Thread.State: BLOCKED
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
>     [junit]         at  org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
>     [junit]         at  org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
>     [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the follower that
doing sync with leader work, and the listener of the qcm of the same quorum peer that doing
the receiving connection work. Basically to finish sync with leader, the follower needs to
synchronize on both QV_LOCK and the qmc object it owns; while in the receiver thread to finish
setup an incoming connection the thread needs to synchronize on both the qcm object the quorum
peer owns, and the same QV_LOCK. It's easy to see the problem here is the order of acquiring
two locks are different, thus depends on timing / actual execution order, two threads might
end up acquiring one lock while holding another.
> [1] org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message