cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10244) Replace heartbeats with locally recorded metrics for failure detection
Date Wed, 02 Sep 2015 04:44:45 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726735#comment-14726735
] 

Jason Brown commented on CASSANDRA-10244:
-----------------------------------------

bq. If two machines A and B don't communicate a lot, are they going to detect that other one
is down on the next read/write?

A node will detect the peer is *unreachable* (the peer may be down, there may be a partition,
and so on) when a response is not received within the timeout window on the first set of requests
to the peer. At that point, A can choose to mark B down after some amount of failures have
occurred; of course, we'll need some alg to determine when to try the peer again. So, yes,
A will need to send requests to B to determine if it is "available", and not rely solely upon
transitive availability via heartbeat propagation.

bq. If any machine detects that a machine is down, will be propagate that information? 

No, because if node A cannot contact node B, how should that affect node C (who is current
having successful communication with B)? Furthermore, how would A inform the rest of the cluster
about it's unavailability to reach B? Gossip? By the time the message from A percolated throughout
the cluster, the partition between A and B could be healed, thus making the "I think node
B is DOWN" no longer correct/relevant, and potentially misleading other nodes in the cluster.
We should avoid being too clever and simply allow each node to make it's own local decision
about the availability of each of the other nodes in the cluster.

bq. We should not confuse the FD if someone is writing a large batch which is slow

I'm not quite understanding how that pertains here, but let me try to respond. If anything,
the inputs to the FD would be something like the response times/rates from peers to which
we have sent requests. If we fail to get responses, that is a red flag (the peer is being
unavailable); but slower responses would just cause that peer to be pushed down in the list
of "fastest peers" via the dsnitch. (If I haven't answered your question, can you please restate
it?)

> Replace heartbeats with locally recorded metrics for failure detection
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-10244
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10244
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>
> In the current implementation, the primary purpose of sending gossip messages is for
delivering the updated heartbeat values of each node in a cluster. The other data that is
passed in gossip (node metadata such as status, dc, rack, tokens, and so on) changes very
infrequently (or rarely), such that the eventual delivery of that data is entirely reasonable.
Heartbeats, however, are quite different. A continuous and nearly consistent delivery time
of updated heartbeats is critical for the stability of a cluster. It is through the receipt
of the updated heartbeat that a node determines the reachability (UP/DOWN status) of all peers
in the cluster. The current implementation of FailureDetector measures the time differences
between the heartbeat updates received about a peer (Note: I said about a peer, not from the
peer directly, as those values are disseminated via gossip). Without a consistent time delivery
of those updates, the FD, via it's use of the PHI-accrual algorigthm, will mark the peer as
DOWN (unreachable). The two nodes could be sending all other traffic without problem, but
if the heartbeats are not propagated correctly, each of the nodes will mark the other as DOWN,
which is clearly suboptimal to cluster health. Further, heartbeat updates are the only mechanism
we use to determine reachability (UP/DOWN) of a peer; dynamic snitch measurements, for example,
are not included in the determination. 
> To illustrate this, in the current implementation, assume a cluster of nodes: A, B, and
C. A partition starts between nodes A and C (no communication succeeds), but both nodes can
communicate with B. As B will get the updated heartbeats from both A and C, it will, via gossip,
send those over to the other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately,
due to the partition between them, all communication between A and C will fail, yet neither
node will mark the other as down because each is receiving, transitively via B, the updated
heartbeat about the other. While it's true that the other node is alive, only having transitive
knowledge about a peer, and allowing that to be the sole determinant of UP/DOWN reachability
status, is not sufficient for a correct and effieicently operating cluster. 
> This transitive availability is suboptimal, and I propose we drop the heartbeat concept
altogether. Instead, the dynamic snitch should become more intelligent, and it's measurements
ultimately become the input for determining the reachability status of each peer(as fed into
a revamped FD). As we already capture latencies in the dsntich, we can reasonably extend it
to include timeouts/missed responses, and make that the basis for the UP/DOWN decisioning.
Thus we will have more accurate and relevant peer statueses that is tailored to the local
node.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message