Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 547B4200C7B for ; Sat, 6 May 2017 07:34:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5302B160BBE; Sat, 6 May 2017 05:34:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 722F1160BAF for ; Sat, 6 May 2017 07:34:10 +0200 (CEST) Received: (qmail 33490 invoked by uid 500); 6 May 2017 05:34:09 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 33479 invoked by uid 99); 6 May 2017 05:34:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 May 2017 05:34:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id E3F53C06D5 for ; Sat, 6 May 2017 05:34:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id lFkORZpf0vpa for ; Sat, 6 May 2017 05:34:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id B63AD5F567 for ; Sat, 6 May 2017 05:34:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C5630E0045 for ; Sat, 6 May 2017 05:34:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id D910321DE0 for ; Sat, 6 May 2017 05:34:04 +0000 (UTC) Date: Sat, 6 May 2017 05:34:04 +0000 (UTC) From: "Michael Han (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 06 May 2017 05:34:11 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Han updated ZOOKEEPER-2778: ----------------------------------- Description: It's possible to have a deadlock during recovery phase. Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest [1]. . Here is a sample thread dump that illustrates the state of the execution: {noformat} [junit] java.lang.Thread.State: BLOCKED [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642) [junit] [junit] java.lang.Thread.State: BLOCKED [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471) [junit] at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520) [junit] at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133) {noformat} The dead lock happens between the quorum peer thread which running the follower that doing sync with leader work, and the listener of the qcm of the same quorum peer that doing the receiving connection work. Basically to finish sync with leader, the follower needs to synchronize on both QV_LOCK and the qmc object it owns; while in the receiver thread to finish setup an incoming connection the thread needs to synchronize on both the qcm object the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here is the order of acquiring two locks are different, thus depends on timing / actual execution order, two threads might end up acquiring one lock while holding another. [1] org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig was: It's possible to have a deadlock during recovery phase. Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest. . Here is a sample thread dump that illustrates the state of the execution: {noformat} [junit] java.lang.Thread.State: BLOCKED [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369) [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642) [junit] [junit] java.lang.Thread.State: BLOCKED [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471) [junit] at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520) [junit] at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133) {noformat} The dead lock happens between the quorum peer thread which running the follower that doing sync with leader work, and the listener of the qcm of the same quorum peer that doing the receiving connection work. Basically to finish sync with leader, the follower needs to synchronize on both QV_LOCK and the qmc object it owns; while in the receiver thread to finish setup an incoming connection the thread needs to synchronize on both the qcm object the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here is the order of acquiring two locks are different, thus depends on timing / actual execution order, two threads might end up acquiring one lock while holding another. > Potential server deadlock between follower sync with leader and follower receiving external connection requests. > ---------------------------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2778 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778 > Project: ZooKeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.5.3 > Reporter: Michael Han > Assignee: Michael Han > Priority: Critical > > It's possible to have a deadlock during recovery phase. > Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest [1]. . Here is a sample thread dump that illustrates the state of the execution: > {noformat} > [junit] java.lang.Thread.State: BLOCKED > [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686) > [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265) > [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445) > [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369) > [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642) > [junit] > [junit] java.lang.Thread.State: BLOCKED > [junit] at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472) > [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438) > [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471) > [junit] at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520) > [junit] at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88) > [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133) > {noformat} > The dead lock happens between the quorum peer thread which running the follower that doing sync with leader work, and the listener of the qcm of the same quorum peer that doing the receiving connection work. Basically to finish sync with leader, the follower needs to synchronize on both QV_LOCK and the qmc object it owns; while in the receiver thread to finish setup an incoming connection the thread needs to synchronize on both the qcm object the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here is the order of acquiring two locks are different, thus depends on timing / actual execution order, two threads might end up acquiring one lock while holding another. > [1] org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig -- This message was sent by Atlassian JIRA (v6.3.15#6346)