cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-4375) FD incorrectly using RPC timeout to ignore gossip heartbeats
Date Wed, 08 Jan 2014 15:41:53 GMT


Brandon Williams updated CASSANDRA-4375:

    Attachment: 4375.txt

I've thought about this a bit, and still think it should default to ring delay, but not be
coupled to it.  However, in normal operation, I do think it makes send to couple the initial
value we seed the FD with and the max interval we accept.  I don't think most people should
be tweaking these though, so I've made them system properties (as ring delay is.)

Patch adds cassandra.fd_initial_value_ms to control the value the FD is seeded with, which
the max interval will also default to, but also adds cassandra.fd_max_interval_ms if you really
need them to be disjoint (most likely for testing like CASSANDRA-6558 where you want the seed
ridiculously low, but the max interval reasonable.)

I will note that I changed the max interval from a double to an int, because a double just
didn't make any sense.

> FD incorrectly using RPC timeout to ignore gossip heartbeats
> ------------------------------------------------------------
>                 Key: CASSANDRA-4375
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Brandon Williams
>              Labels: gossip
>             Fix For: 1.2.14
>         Attachments: 4375.txt
> Short version: You can't run a cluster with short RPC timeouts because nodes just constantly
flap up/down.
> Long version:
> CASSANDRA-3273 tried to fix a problem resulting from the way the failure detector works,
but did so by introducing a much more sever bug: With low RPC timeouts, that are lower than
the typical gossip propagation time, a cluster will just constantly have all nodes flapping
other nodes up and down.
> The cause is this:
> {code}
> +    // in the event of a long partition, never record an interval longer than the rpc
> +    // since if a host is regularly experiencing connectivity problems lasting this
long we'd
> +    // rather mark it down quickly instead of adapting
> +    private final double MAX_INTERVAL_IN_MS = DatabaseDescriptor.getRpcTimeout();
> {code}
> And then:
> {code}
> -        tLast_ = value;            
> -        arrivalIntervals_.add(interArrivalTime);        
> +        if (interArrivalTime <= MAX_INTERVAL_IN_MS)
> +            arrivalIntervals_.add(interArrivalTime);
> +        else
> +            logger_.debug("Ignoring interval time of {}", interArrivalTime);
> {code}
> Using the RPC timeout to ignore unreasonably long intervals is not correct, as the RPC
timeout is completely orthogonal to gossip propagation delay (see CASSANDRA-3927 for a quick
description of how the FD works).
> In practice, the propagation delay ends up being in the 0-3 second range on a cluster
with good local latency. With a low RPC timeout of say 200 ms, very few heartbeat updates
come in fast enough that it doesn't get ignored by the failure detector. This in turn means
that the FD records a completely skewed average heartbeat interval, which in turn means that
nodes almost always get flapped on interpret() unless they happen to *just* have had their
heartbeat updated. Then they flap back up whenever the next heartbeat comes in (since it gets
brought up immediately).
> In our build, we are replacing the FD with an implementation that simply uses a fixed
{{N}} second time to convict, because this is just one of many ways in which the current FD
hurts, while we still haven't found a way it actually helps relative to the trivial fixed-second
conviction policy.
> For upstream, assuming people won't agree on changing it to a fixed timeout, I suggest,
at minimum, never using a value lower than something like 10 seconds or something, when determining
whether to ignore. Slightly better is to make it a config option.
> (I should note that if propagation delays are significantly off from the expected level,
other things than the FD already breaks - such as the whole concept of {{RING_DELAY}}, which
assumes the propagation time is roughly constant with e.g. cluster size.)

This message was sent by Atlassian JIRA

View raw message