cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Esken (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
Date Fri, 24 Feb 2017 16:17:44 GMT
Christian Esken created CASSANDRA-13265:
-------------------------------------------

             Summary: Communication breakdown in OutboundTcpConnection
                 Key: CASSANDRA-13265
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
             Project: Cassandra
          Issue Type: Bug
         Environment: Cassandra 3.0.9
Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 1.8.0_112-b15)
Linux 3.16
            Reporter: Christian Esken


I observed that sometimes a single node in a Cassandra cluster fails to communicate to the
other nodes. This can happen at any time, during peak load or low load. Restarting that single
node from the cluster fixes the issue.

Before going in to details, I want to state that I have analyzed the situation and am already
developing a possible fix. Here is the analysis so far:

- A Threaddump in this situation showed that 324 Threads in the OutboundTcpConnection class
wanted to lock the backlog queue for doing expiration.
- A class histogram shows 262508 instances of OutboundTcpConnection$QueuedMessage.

What is the effect of it? As soon as the Cassandra node has reached that state, it never gets
out of it by itself, it is thrashing itself to death instead, as each of the Thread fully
locks the Queue for reading and writing by calling iterator.next().
- Writing: Only after 262508 locking operation it can progress with actually writing to the
Queue.
- Reading: Is also blocked, as 324 Threads try to do iterator.next(), and fully lock the Queue

This means: Writing blocks the Queue for reading, and readers might even be starved which
makes the situation even worse.

-----
The setup is:
 - 3-node cluster
 - replication factor 2
 - Consistency LOCAL_ONE
 - No remote DC's
 - high write throughput (100000 INSERT statements per second and more during peak times).
 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message