cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
Date Sat, 25 Apr 2015 15:02:39 GMT


Ariel Weisberg commented on CASSANDRA-8789:

I ran exactly what you suggested, except I routed gossip on the large message socket and set
the large message threshold to Integer.MAX_VALUE.

getConnection() looked like
     * returns the appropriate connection based on message type.
     * returns null if a connection could not be established.
    OutboundTcpConnection getConnection(MessageOut msg)
        if (msg.getStage() == Stage.GOSSIP) {
            return largeMessages;
        return msg.payloadSize(smallMessages.getTargetVersion()) > LARGE_MESSAGE_THRESHOLD
               ? largeMessages
               : smallMessages;

And it fails in the exact same way. The fact that you have to pull in the coalescing fixes
to get it to fail further confirms my belief that messaging got faster (when there are no
network issues) not slower and that is leading to the node hanging. 2.0 doesn't log pending
tasks in each stage so I would have to instrument some more to confirm this is the issue.

Trying to further prove that thesis I cherry-picked only 144644bbf77a546c45db384e2dbc18e13f65c9ce
and it ran 10 million writes no problem. Doesn't mean there isn't a head of line blocking
issue when network connections are genuinely slow. That's why I created CASSANDRA-9237 and
I have a couple ideas of how to make FD less dependent on heartbeats or how to get gossip
messages to not be blocked.

Taking it one more step further I added back coalescing, but not the full deal. I just fixed
a bug in OutboundTcpConnection where it would never write multiple messages at once without
diff --git a/src/java/org/apache/cassandra/net/ b/src/java/org/apache/cassandra/net/
index cddce07..e90cef8 100644
--- a/src/java/org/apache/cassandra/net/
+++ b/src/java/org/apache/cassandra/net/
@@ -132,7 +132,7 @@ public class OutboundTcpConnection extends Thread
         while (true)
-            if (backlog.drainTo(drainedMessages, drainedMessages.size()) == 0)
+            if (backlog.drainTo(drainedMessages, 128) == 0)
@@ -142,7 +142,7 @@ public class OutboundTcpConnection extends Thread
                     throw new AssertionError(e);
+                backlog.drainTo(drainedMessages, 127);
             currentMsgBufferCount = drainedMessages.size();
With this change it fails.

Fundamentally the changes in this ticket as Benedict pointed out are not completely new. Gossip
always contended with mutation responses and read responses. The big change is that "small"
mutations share a socket with gossip messages.

> OutboundTcpConnectionPool should route messages to sockets by size not type
> ---------------------------------------------------------------------------
>                 Key: CASSANDRA-8789
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 3.0
>         Attachments: 8789.diff
> I was looking at this trying to understand what messages flow over which connection.
> For reads the request goes out over the command connection and the response comes back
over the ack connection.
> For writes the request goes out over the command connection and the response comes back
over the command connection.
> Reads get a dedicated socket for responses. Mutation commands and responses both travel
over the same socket along with read requests.
> Sockets are used uni-directional so there are actually four sockets in play and four
threads at each node (2 inbounded, 2 outbound).
> CASSANDRA-488 doesn't leave a record of what the impact of this change was. If someone
remembers what situations were made better it would be good to know.
> I am not clear on when/how this is helpful. The consumer side shouldn't be blocking so
the only head of line blocking issue is the time it takes to transfer data over the wire.
> If message size is the cause of blocking issues then the current design mixes small messages
and large messages on the same connection retaining the head of line blocking.
> Read requests share the same connection as write requests (which are large), and write
acknowledgments (which are small) share the same connections as write requests. The only winner
is read acknowledgements.

This message was sent by Atlassian JIRA

View raw message