cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-13204) Thread Leak in OutboundTcpConnection
Date Fri, 10 Feb 2017 19:33:41 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861728#comment-15861728
] 

Ariel Weisberg commented on CASSANDRA-13204:
--------------------------------------------

+1

> Thread Leak in OutboundTcpConnection
> ------------------------------------
>
>                 Key: CASSANDRA-13204
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13204
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: sankalp kohli
>            Assignee: Jason Brown
>             Fix For: 3.0.11, 2.1.x, 2.2.x, 3.11.x
>
>
> We found threads leaking from OutboundTcpConnection to machines which are not part of
the cluster and still in Gossip for some reason. There are two issues here, this JIRA will
cover the second one which is most important. 
> 1) First issue is that Gossip has information about machines not in the ring which has
been replaced out. It causes Cassandra to connect to those machines but due to internode auth,
it wont be able to connect to them at the socket level.  
> 2) Second issue is a race between creating a connection and closing a connections which
is triggered by the gossip bug explained above. Let me try to explain it using the code
> In OutboundTcpConnection, we are calling closeSocket(true) which will set isStopped=true
and also put a close sentinel into the queue to exit the thread. On the ack connection, Gossip
tries to send a message which calls connect() which will block for 10 seconds which is RPC
timeout. The reason we will block is because Cassandra might not be running there or internode
auth will not let it connect. During this 10 seconds, if Gossip calls closeSocket, it will
put close sentinel into the queue. When we return from the connect method after 10 seconds,
we will clear the backlog queue causing this thread to leak. 
> Proofs from the heap dump of the affected machine which is leaking threads 
> 1. Only ack connection is leaking and not the command connection which is not used by
Gossip. 
> 2. We see thread blocked on the backlog queue, isStopped=true and backlog queue is empty.
This is happening on the threads which have already leaked. 
> 3. A running thread was blocked on the connect waiting for timeout(10 seconds) and we
see backlog queue to contain the close sentinel. Once the connect will return false, we will
clear the backlog and this thread will have leaked.  
> Interesting bits from j stack 
> 1282 number of threads for "MessagingService-Outgoing-/<IP-Address>"
> Thread which is about to leak:
> "MessagingService-Outgoing-/<IP Address>" 
>    java.lang.Thread.State: RUNNABLE
> 	at sun.nio.ch.Net.connect0(Native Method)
> 	at sun.nio.ch.Net.connect(Net.java:454)
> 	at sun.nio.ch.Net.connect(Net.java:446)
> 	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
> 	- locked <> (a java.lang.Object)
> 	- locked <> (a java.lang.Object)
> 	- locked <> (a java.lang.Object)
> 	at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:137)
> 	at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119)
> 	at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:381)
> 	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:217)
> Thread already leaked:
> "MessagingService-Outgoing-/<IP Address>"
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 	at org.apache.cassandra.utils.CoalescingStrategies$DisabledCoalescingStrategy.coalesceInternal(CoalescingStrategies.java:482)
> 	at org.apache.cassandra.utils.CoalescingStrategies$CoalescingStrategy.coalesce(CoalescingStrategies.java:213)
> 	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:190)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message