Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A758D200C16 for ; Thu, 9 Feb 2017 19:53:50 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A5EE5160B50; Thu, 9 Feb 2017 18:53:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EF69F160B64 for ; Thu, 9 Feb 2017 19:53:49 +0100 (CET) Received: (qmail 36851 invoked by uid 500); 9 Feb 2017 18:53:49 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 36840 invoked by uid 99); 9 Feb 2017 18:53:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2017 18:53:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 96737C0C5C for ; Thu, 9 Feb 2017 18:53:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 0Qe4M8PmCjjh for ; Thu, 9 Feb 2017 18:53:47 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id B2BF75FD3B for ; Thu, 9 Feb 2017 18:53:46 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C5111E0645 for ; Thu, 9 Feb 2017 18:53:43 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4E3D821D7F for ; Thu, 9 Feb 2017 18:53:42 +0000 (UTC) Date: Thu, 9 Feb 2017 18:53:42 +0000 (UTC) From: "Michael Shuler (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-13204) Thread Leak in OutboundTcpConnection MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 09 Feb 2017 18:53:50 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Shuler updated CASSANDRA-13204: --------------------------------------- Fix Version/s: 3.11.x 2.2.x 2.1.x 3.0.11 > Thread Leak in OutboundTcpConnection > ------------------------------------ > > Key: CASSANDRA-13204 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13204 > Project: Cassandra > Issue Type: Bug > Reporter: sankalp kohli > Assignee: Jason Brown > Fix For: 3.0.11, 2.1.x, 2.2.x, 3.11.x > > > We found threads leaking from OutboundTcpConnection to machines which are not part of the cluster and still in Gossip for some reason. There are two issues here, this JIRA will cover the second one which is most important. > 1) First issue is that Gossip has information about machines not in the ring which has been replaced out. It causes Cassandra to connect to those machines but due to internode auth, it wont be able to connect to them at the socket level. > 2) Second issue is a race between creating a connection and closing a connections which is triggered by the gossip bug explained above. Let me try to explain it using the code > In OutboundTcpConnection, we are calling closeSocket(true) which will set isStopped=true and also put a close sentinel into the queue to exit the thread. On the ack connection, Gossip tries to send a message which calls connect() which will block for 10 seconds which is RPC timeout. The reason we will block is because Cassandra might not be running there or internode auth will not let it connect. During this 10 seconds, if Gossip calls closeSocket, it will put close sentinel into the queue. When we return from the connect method after 10 seconds, we will clear the backlog queue causing this thread to leak. > Proofs from the heap dump of the affected machine which is leaking threads > 1. Only ack connection is leaking and not the command connection which is not used by Gossip. > 2. We see thread blocked on the backlog queue, isStopped=true and backlog queue is empty. This is happening on the threads which have already leaked. > 3. A running thread was blocked on the connect waiting for timeout(10 seconds) and we see backlog queue to contain the close sentinel. Once the connect will return false, we will clear the backlog and this thread will have leaked. > Interesting bits from j stack > 1282 number of threads for "MessagingService-Outgoing-/" > Thread which is about to leak: > "MessagingService-Outgoing-/" > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.Net.connect0(Native Method) > at sun.nio.ch.Net.connect(Net.java:454) > at sun.nio.ch.Net.connect(Net.java:446) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) > - locked <> (a java.lang.Object) > - locked <> (a java.lang.Object) > - locked <> (a java.lang.Object) > at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:137) > at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119) > at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:381) > at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:217) > Thread already leaked: > "MessagingService-Outgoing-/" > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at org.apache.cassandra.utils.CoalescingStrategies$DisabledCoalescingStrategy.coalesceInternal(CoalescingStrategies.java:482) > at org.apache.cassandra.utils.CoalescingStrategies$CoalescingStrategy.coalesce(CoalescingStrategies.java:213) > at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:190) -- This message was sent by Atlassian JIRA (v6.3.15#6346)