zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Han (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.
Date Sat, 06 May 2017 05:09:04 GMT
Michael Han created ZOOKEEPER-2778:

             Summary: Potential server deadlock between follower sync with leader and follower
receiving external connection requests.
                 Key: ZOOKEEPER-2778
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.5.3
            Reporter: Michael Han
            Assignee: Michael Han
            Priority: Critical

It's possible to have a deadlock during recovery phase. 
Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest. . Here is a sample
thread dump that illustrates the state of the execution:

    [junit]  java.lang.Thread.State: BLOCKED
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)

    [junit]  java.lang.Thread.State: BLOCKED
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
    [junit]         at  org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
    [junit]         at  org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
    [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)

The dead lock happens between the quorum peer thread which running the follower that doing
sync with leader work, and the listener of the qcm of the same quorum peer that doing the
receiving connection work. Basically to finish sync with leader, the follower needs to synchronize
on both QV_LOCK and the qmc object it owns; while in the receiver thread to finish setup an
incoming connection the thread needs to synchronize on both the qcm object the quorum peer
owns, and the same QV_LOCK. It's easy to see the problem here is the order of acquiring two
locks are different, thus depends on timing / actual execution order, two threads might end
up acquiring one lock while holding another.

This message was sent by Atlassian JIRA

View raw message