cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3569) Failure detector downs should not break streams
Date Thu, 08 Dec 2011 05:49:40 GMT


Peter Schuller commented on CASSANDRA-3569:

So yes, I *am* saying that the actual TCP connection is a more reliable detector, in terms
of false negatives being by definition impossible, than the failure detecter since the only
thing the streaming actually depends on is the data being transfered over that TCP connection.

There idea that the failure detector might somehow be better than a keep-alive is IMO not
worth the negative side-effects of bogusly killing running repairs.

I *do* however agree that since we cannot choose the TCP keep-alive behavior on a per-socket
basis, there is in fact some actual functionality in the failure detector that we don't get
in keep-alive, so I will grant that the failure detector is not strictly <= the TCP connection.

When I say that the FD is orthogonal, I am specifically referring to the fact that our FD
is essentially doing messaging *on top* of a TCP connection. If we had been UDP based, it
would make sense to let the FD be the keeper of logic having to do with the connectivity we
have to other nodes. But given that we are using TCP, it's not adding anything, it seems to
me, except for the fact that the FD is more tweakable (from the app) than the keep-alive timeouts.
Or, put another way, the FD is communicating over a channel which by design is never supposed
to "block"; we could effectively use a TCP timeout there, which we cannot for streaming because
we don't want to cause streaming TCP connections to die just because e.g. the receiver is
legitimately blocking for an extended period of time (this is why keep-alive is more suitable
since it applies to the health of the underlying transport and is independent of whether in-band
data is being sent).

I'm not claiming that at all. I was really only reminding that the main goal of detecting
failure in the first place was to address frustrated and confused users of hanging repair
to be sure we were on the same page. 

I completely agree with the goal (after all I *am* a user, and a user with lots of history
of pain with repair), just not the method used to achieve it - because the cost of slaying
perfectly working repair sessions is just too high. Given that, I prefer keep-alive because
it *does* do the job except for the problem of non-tweakable time frames which is still acceptable.
A two hour wait in the unusual cases where a connection silently dies in a way that does not
generate a RST or otherwise close the connection, does not feel like the end of the world
to me. It is absolutely not optimal though, I agree with that.

Quoting out of order:

 And I'm personally fine having a 'different and wider discussion' if that helps improving
the code (btw, if our way of doing failure detection is really fundamentally broken in many
ways, it shouldn't be too hard to show how).

The short version is that the failure detecter seems to be designed for non-reliable datagram
messaging. Given that we use TCP, it already does what we need to determine "are we able to
communicate with this guy at all?". Everything else we would need good handling of, like half-down
failure modes, the FD is mostly useless for us anyway (the best thing we have is the dynamic
snitch, but that also has issues). It seems to me the FD is more of a show-off of an algorithm
because the paper looked cool, than actually addressing a real-world problem for Cassandra
(I'm sorry but that really is what it seems like to me).

Witness CASSANDRA-3294 for a good example of this reality vs. theory split-brain problem.
Why on earth are we waiting on the FD to figure out that we shouldn't send messages to a node,
when we have *absolute perfect proof* that the *only* communication channel we have with the
other node is *not working*? Having 1/N requests (in an N node cluster) suddenly stall for
rpc_timeout is not a sensible behavior for high-request rate production systems, when the
failure mode is one of the most friendly modes you can imagine (clean process death, OS knows
it, TCP connection dies, other end gets an RST).

It is valid that the fact that a TCP connection has a hiccup is probably not a good reason
to consider a node down in the sense that we shouldn't be trying to e.g. stream from it during
bootstrap etc. It makes sense to filter out flapping there, maybe. But "sending" messages
to a node with whom we currently cannot communicate, by definition, seems obviously bad. And
it is a real problem in production, although mitigated once you realize this (undocumented)
limitation and start doing the disablegossip+disablethrift+wait-for-a-while dance.

(If there is some kind of network flap so that e.g. all replicas for a row are all temporarily
unreachable, it's also perfectly valid to fast-fail and return an error back to the client,
rather than queueing up messages to be delivered later on or something like that.)

So essentially, my feeling is that the FD is trying to a very general problem which is only
partially overlapping with the real-world concerns of Cassandra, while failing to address
the simplest cases that are also very common.

Anyways, this is really something which could get discussed in another ticket of its own.
But, in my opinion there are far more important things to change than the workings of the
failure detector.

> Failure detector downs should not break streams
> -----------------------------------------------
>                 Key: CASSANDRA-3569
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit there waiting
forever. In my opinion the correct fix to that problem is to use TCP keep alive. Unfortunately
the TCP keep alive period is insanely high by default on a modern Linux, so just doing that
is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a communication method
which is the TCP transport, that we know is used for long-running processes that you don't
want to incorrectly be killed for no good reason, and we are using a failure detector tuned
to detecting when not to send real-time sensitive request to nodes in order to actively kill
a working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I propose that
we simply just use TCP keep alive for streaming connections and instruct operators of production
clusters to tweak net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent
on their OS).
> I can submit the patch. Awaiting opinions.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message