cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Seiler (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6899) Don't include time to read a message in determining whether to drop message
Date Fri, 21 Mar 2014 05:24:42 GMT


Oliver Seiler commented on CASSANDRA-6899:

This was for a hint to a second DC; the client submitted the batch at LOCAL_QUORUM and had
no errors. My problem here was that the hint mutation messages were being dropped on a cluster
without any other load.

I agree that would be better to have smaller messages (and am dealing with that issue).

> Don't include time to read a message in determining whether to drop message
> ---------------------------------------------------------------------------
>                 Key: CASSANDRA-6899
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Oliver Seiler
>            Priority: Minor
> This came out of trying to understand why I was seeing a large number of dropped (mutation)
messages on an otherwise quiet test cluster that had previously been run such that there were
a large number of queued hints from nodes in DC 1 to nodes in DC 2. The cluster version is
Cassandra 2.0.4, 3 nodes in DC 1, 3 nodes in DC 2, with RF=3 for each DC. I think it's relevant
to mention that we've enabled the inter_dc_tcp_nodelay setting.
> Virtually no debug logging is done for dropped messages, so I had to dig down into the
source to try to figure out what is going on. It appears the message is large enough that,
combined with our enabling of the inter_dc_tcp_nodelay, the time taken to read the read the
message from the socket exceeds the default 2 second write_request_timeout_in_ms setting used
to determine whether to drop mutation messages. Note that we don't see any dropped messages
in DC 1, which is why I believe this is related to inter_dc_tcp_nodelay; because this is a
test cluster, the two DCs are actually on the same network (1GigE).
> The specific issue I'm raising here is in,
which obtains a timestamp before reading the message payload. This doesn't seem useful, since
at the point the message would get dropped (MessageDeliveryTask::run) we've already read the
message, queued it to the MutationStage thread pool via MessageDeliveryTask, and have MessageDeliveryTask
running. It isn't clear to me why we'd want to include the time to read the message off the
wire to determine whether the thread pool is backlogging, since in this case the thread pool
*isn't* backlogging at all. In fact, once in this state, not much is going to allow the message
to get processed (short of a configuration change), resulting in the message being re-sent
every ten minutes; in this case a 'nodetool repair' was required to clear out the hints.
> Am I missing something in this? It seems intentional in IncomingTcpConnection, given
the way that the cross_node_timeout setting is used, and clearly we shouldn't be generating
large messages like this, but it doesn't seem useful to have logic that results in messages
being dropped when there isn't actually any load-related reason for doing so.

This message was sent by Atlassian JIRA

View raw message