cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Seiler (JIRA)" <>
Subject [jira] [Created] (CASSANDRA-6899) Don't include time to read a message in determining whether to drop message
Date Thu, 20 Mar 2014 20:13:45 GMT
Oliver Seiler created CASSANDRA-6899:

             Summary: Don't include time to read a message in determining whether to drop
                 Key: CASSANDRA-6899
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Oliver Seiler
            Priority: Minor

This came out of trying to understand why I was seeing a large number of dropped (mutation)
messages on an otherwise quiet test cluster that had previously been run such that there were
a large number of queued hints from nodes in DC 1 to nodes in DC 2. The cluster version is
Cassandra 2.0.4, 3 nodes in DC 1, 3 nodes in DC 2, with RF=3 for each DC. I think it's relevant
to mention that we've enabled the inter_dc_tcp_nodelay setting.

Virtually no debug logging is done for dropped messages, so I had to dig down into the source
to try to figure out what is going on. It appears the message is large enough that, combined
with our enabling of the inter_dc_tcp_nodelay, the time taken to read the read the message
from the socket exceeds the default 2 second write_request_timeout_in_ms setting used to determine
whether to drop mutation messages. Note that we don't see any dropped messages in DC 1, which
is why I believe this is related to inter_dc_tcp_nodelay; because this is a test cluster,
the two DCs are actually on the same network (1GigE).

The specific issue I'm raising here is in,
which obtains a timestamp before reading the message payload. This doesn't seem useful, since
at the point the message would get dropped (MessageDeliveryTask::run) we've already read the
message, queued it to the MutationStage thread pool via MessageDeliveryTask, and have MessageDeliveryTask
running. It isn't clear to me why we'd want to include the time to read the message off the
wire to determine whether the thread pool is backlogging, since in this case the thread pool
*isn't* backlogging at all. In fact, once in this state, not much is going to allow the message
to get processed (short of a configuration change), resulting in the message being re-sent
every ten minutes; in this case a 'nodetool repair' was required to clear out the hints.

Am I missing something in this? It seems intentional in IncomingTcpConnection, given the way
that the cross_node_timeout setting is used, and clearly we shouldn't be generating large
messages like this, but it doesn't seem useful to have logic that results in messages being
dropped when there isn't actually any load-related reason for doing so.

This message was sent by Atlassian JIRA

View raw message