cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-9237) Gossip messages subject to head of line blocking by other intra-cluster traffic
Date Wed, 22 Jul 2015 22:44:05 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637787#comment-14637787
] 

Jason Brown edited comment on CASSANDRA-9237 at 7/22/15 10:43 PM:
------------------------------------------------------------------

To be clear, let's understand what the real implication is here with the 'gossip' messages.
The purpose of sending the gossip messages in the current implementation is that it is the
primary vehicle for delivering updated heartbeat values of nodes in the cluster. The other
data that is passed in gossip (node metadata such as status, dc, rack, tokens, and so on)
changes very infrequently (or rarely), such that the eventual (or delayed!) delivery of that
data is reasonable. Heartbeats, however, are quite different. A continuous and nearly consistent
delivery time of updated heartbeats is critical for the stability of a cluster. You see, it
is through the receipt of the updated heartbeat that a node determines the reachability (UP/DOWN
status) of all peers in the cluster. The current implementation of FailureDetector measures
the time differences between the heartbeat updates received about a peer (Note: I said *about*
a peer, not *from* the peer directly, as those values are disseminated via gossip). Without
a consistent time delivery of those updates, the FD, via it's use of the PHI-accrual algorigthm,
will mark the peer as DOWN (unreachable). The two nodes could be sending all other traffic
without problem, but if the heartbeats are not propagated correctly, each of the nodes will
mark the other as DOWN, which is clearly suboptimal to cluster health. Note that heartbeat
updates are the only mechanism we use to determine reachability (UP/DOWN) of a peer; dynamic
snitch measurements, for example, are not included in the determination.  Hence, CASSANDRA-8789
could be quite a problem with regard to cluster stability, and raised by this ticket. (shame
on me for not raising any concerns earlier).

Now, all this being said, we have a dilemma about what to do with regard to to heartbeat dissemination.
I propose we drop the heartbeat concept altogether. The functionality we would lose immediately
is the ability to declare a peer node as UP or DOWN. To make up for that, the dynamic snitch
becomes much more intelligent and it's measurements ultimately become responsible for determining
the reachability status (input to a revamped FD). As we already capture latencies in the dsntich,
we can reasonably extend this to include timeouts/missed responses, and make that the basis
for the UP/DOWN decisioning. Not only will this be more efficient as we will only need to
connect to and track the responses of peer that a node actually connects to, it will lead
to more relevant decisions about the reachability of a peer.

To illustrate this last point, in the current implementation, assume a cluster of nodes: A,
B, and C. A partition starts between nodes A and C (no communication succeeds), but both nodes
can communicate with B. As B will get the updated heartbeats from both A and C, it will, via
gossip, send those over to the other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately,
due to the partition between them, all communication between A and C will fail, yet neither
node will mark the other as down because each is receiving, transitively via B, the updated
heartbeat about the other. While it's true that the other node is alive, only having transitive
knowledge about a peer, and allowing that to be the sole determinant of UP/DOWN reachability
status, is not sufficient for a correct and effieicently operating cluster. Thus, if the dynamic
snitch had an expanded role, where it's observations, based on actual in-use communication
paths, fed directly into the FailureDetector (for the UP/DOWN status), I think we would have
a better system, and one that more accurately reflects the state of the reachable cluster
that is available to each node.

I expect there would be some subtleties and complications with this idea, but I feel those
are surmountable implementation details.

Going back to this ticket, then, I am convinced that if we eliminate the time-sensitive delivery
of the heartbeats (which drives our current notions of peer availability), then we don't need
to be overly concerned about the HoL issues raised here.


was (Author: jasobrown):
To be clear, let's understand what the real implication is here with the 'gossip' messages.
The purpose of sending the gossip messages in the current implementation is that it is the
primary vehicle for delivering updated heartbeat values of nodes in the cluster. The other
data that is passed in gossip (node metadata such as status, dc, rack, tokens, and so on)
changes very infrequently (or rarely), such that the eventual (or delayed!) delivery of that
data is reasonable. Heartbeats, however, are quite different. A continuous and nearly consistent
delivery time of updated heartbeats is critical for the stability of a cluster. You see, it
is through the receipt of the updated heartbeat that a node determines the reachability (UP/DOWN
status) of all peers in the cluster. The current implementation of FailureDetector measures
the time differences between the heartbeat updates received about a peer (Note: I said *about*
a peer, not from athe directly, as those values are disseminated via gossip). Without a consistent
time delivery of those updates, the FD, via it's use of the PHI-accrual algorigthm, will mark
the peer as DOWN (unreachable). The two nodes could be sending all other traffic without problem,
but if the heartbeats are not propagated correctly, each of the nodes will mark the other
as DOWN, which is clearly suboptimal to cluster health. Note that heartbeat updates are the
only mechanism we use to determine reachability (UP/DOWN) of a peer; dynamic snitch measurements,
for example, are not included in the determination.  Hence, CASSANDRA-8789 could be quite
a problem with regard to cluster stability, and raised by this ticket. (shame on me for not
raising any concerns earlier).

Now, all this being said, we have a dilemma about what to do with regard to to heartbeat dissemination.
I propose we drop the heartbeat concept altogether. The functionality we would lose immediately
is the ability to declare a peer node as UP or DOWN. To make up for that, the dynamic snitch
becomes much more intelligent and it's measurements ultimately become responsible for determining
the reachability status (input to a revamped FD). As we already capture latencies in the dsntich,
we can reasonably extend this to include timeouts/missed responses, and make that the basis
for the UP/DOWN decisioning. Not only will this be more efficient as we will only need to
connect to and track the responses of peer that a node actually connects to, it will lead
to more relevant decisions about the reachability of a peer.

To illustrate this last point, in the current implementation, assume a cluster of nodes: A,
B, and C. A partition starts between nodes A and C (no communication succeeds), but both nodes
can communicate with B. As B will get the updated heartbeats from both A and C, it will, via
gossip, send those over to the other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately,
due to the partition between them, all communication between A and C will fail, yet neither
node will mark the other as down because each is receiving, transitively via B, the updated
heartbeat about the other. While it's true that the other node is alive, only having transitive
knowledge about a peer, and allowing that to be the sole determinant of UP/DOWN reachability
status, is not sufficient for a correct and effieicently operating cluster. Thus, if the dynamic
snitch had an expanded role, where it's observations, based on actual in-use communication
paths, fed directly into the FailureDetector (for the UP/DOWN status), I think we would have
a better system, and one that more accurately reflects the state of the reachable cluster
that is available to each node.

I expect there would be some subtleties and complications with this idea, but I feel those
are surmountable implementation details.

Going back to this ticket, then, I am convinced that if we eliminate the time-sensitive delivery
of the heartbeats (which drives our current notions of peer availability), then we don't need
to be overly concerned about the HoL issues raised here.

> Gossip messages subject to head of line blocking by other intra-cluster traffic
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-9237
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9237
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>
> Reported as an issue over less than perfect networks like VPNs between data centers.
> Gossip goes over the small message socket where small is 64k which isn't particularly
small. This is done for performance to keep most traffic on one hot socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message