Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C230D200C30 for ; Tue, 7 Mar 2017 22:10:44 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C0A64160B68; Tue, 7 Mar 2017 21:10:44 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E648F160B65 for ; Tue, 7 Mar 2017 22:10:43 +0100 (CET) Received: (qmail 60132 invoked by uid 500); 7 Mar 2017 21:10:43 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 60121 invoked by uid 99); 7 Mar 2017 21:10:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Mar 2017 21:10:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 9B82CC0B33 for ; Tue, 7 Mar 2017 21:10:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.452 X-Spam-Level: * X-Spam-Status: No, score=1.452 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 8IedLCzHb91m for ; Tue, 7 Mar 2017 21:10:41 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id BA4325F1EE for ; Tue, 7 Mar 2017 21:10:40 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B2E23E086A for ; Tue, 7 Mar 2017 21:10:38 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1B85424176 for ; Tue, 7 Mar 2017 21:10:38 +0000 (UTC) Date: Tue, 7 Mar 2017 21:10:38 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2711) Deadlock between concurrent 4LW commands that iterate over connections with Netty server MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 07 Mar 2017 21:10:44 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900170#comment-15900170 ] ASF GitHub Bot commented on ZOOKEEPER-2711: ------------------------------------------- GitHub user joshelser opened a pull request: https://github.com/apache/zookeeper/pull/186 ZOOKEEPER-2711 Avoid synchronization on NettyServerCnxn in Factory NettyServerCnxnFactory previously synchronized on the (Netty)ServerCnxn object to provide mutual exclusion at the RPC layer. However, this was at odds with the synchronized methods in ServerCnxn (which shared the same monitor). As such, it was possible to deadlock between concurrent 4LW commands that were invoking one of these synchronized methods on ServerCnxn. You can merge this pull request into a Git repository by running: $ git pull https://github.com/joshelser/zookeeper 2711-4lw Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/186.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #186 ---- commit 9953e32963e6167f88b97519233b0872cce69a71 Author: Josh Elser Date: 2017-03-07T21:03:57Z ZOOKEEPER-2711 Avoid synchronization on NettyServerCnxn in Factory NettyServerCnxnFactory previously synchronized on the (Netty)ServerCnxn object to provide mutual exclusion at the RPC layer. However, this was at odds with the synchronized methods in ServerCnxn (which shared the same monitor). As such, it was possible to deadlock between concurrent 4LW commands that were invoking one of these synchronized methods on ServerCnxn. ---- > Deadlock between concurrent 4LW commands that iterate over connections with Netty server > ---------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2711 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2711 > Project: ZooKeeper > Issue Type: Bug > Reporter: Josh Elser > Priority: Critical > > Observed the following issue in some $dayjob testing environments. Line numbers are a little off compared to master/branch-3.5, but I did confirm the same issue exists there. > With the NettyServerCnxnFactory, before a request is dispatched, the code synchronizes on the {{NettyServerCnxn}} object. However, with some 4LW commands (like {{stat}}), each {{ServerCnxn}} object is also synchronized to (safely) iterate over the internal contents of the object to generate the necessary debug message. As such, multiple concurrent {{stat}} commands can both lock their own {{NettyServerCnxn}} objects, and then be blocked waiting to lock each others' {{ServerCnxn}} in the {{StatCommand}}, deadlocked. > {noformat} > "New I/O worker #55": > at org.apache.zookeeper.server.ServerCnxn.dumpConnectionInfo(ServerCnxn.java:407) > - waiting to lock <0x00000000fabc01b8> (a org.apache.zookeeper.server.NettyServerCnxn) > at org.apache.zookeeper.server.NettyServerCnxn$StatCommand.commandRun(NettyServerCnxn.java:478) > at org.apache.zookeeper.server.NettyServerCnxn$CommandThread.run(NettyServerCnxn.java:311) > at org.apache.zookeeper.server.NettyServerCnxn$CommandThread.start(NettyServerCnxn.java:306) > at org.apache.zookeeper.server.NettyServerCnxn.checkFourLetterWord(NettyServerCnxn.java:677) > at org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:790) > at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.processMessage(NettyServerCnxnFactory.java:211) > at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.messageReceived(NettyServerCnxnFactory.java:135) > - locked <0x00000000fab68178> (a org.apache.zookeeper.server.NettyServerCnxn) > at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88) > at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) > at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) > at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) > at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) > at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) > at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) > at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) > at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > "New I/O worker #51": > at org.apache.zookeeper.server.ServerCnxn.dumpConnectionInfo(ServerCnxn.java:407) > - waiting to lock <0x00000000fab68178> (a org.apache.zookeeper.server.NettyServerCnxn) > at org.apache.zookeeper.server.NettyServerCnxn$StatCommand.commandRun(NettyServerCnxn.java:478) > at org.apache.zookeeper.server.NettyServerCnxn$CommandThread.run(NettyServerCnxn.java:311) > at org.apache.zookeeper.server.NettyServerCnxn$CommandThread.start(NettyServerCnxn.java:306) > at org.apache.zookeeper.server.NettyServerCnxn.checkFourLetterWord(NettyServerCnxn.java:677) > at org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:790) > at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.processMessage(NettyServerCnxnFactory.java:211) > at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.messageReceived(NettyServerCnxnFactory.java:135) > - locked <0x00000000fabc01b8> (a org.apache.zookeeper.server.NettyServerCnxn) > at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88) > at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) > at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) > at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) > at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) > at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) > at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) > at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) > at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > It would appear that the synchronization on the {{NettyServerCnxn}} in {{NettyServerCnxnFactory}} is to blame (and I can see why it was done originally). I think we can just use a different Object (and monitor) to provide mutual exclusion at Netty layer (and avoid synchronization issues at the "application" layer). -- This message was sent by Atlassian JIRA (v6.3.15#6346)