cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Lynch (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters
Date Wed, 08 Nov 2017 01:22:00 GMT


Joseph Lynch commented on CASSANDRA-14001:

I think CASSANDRA-13993 might help with this, but I _thin_  it's solving a slightly different

> Gossip after node restart can take a long time to converge about "down" nodes in large
> -----------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-14001
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Lifecycle
>            Reporter: Joseph Lynch
>            Priority: Minor
> When nodes restart in a large cluster, they mark all nodes as "alive", which first calls
{{markDead}} and then creates an {{EchoMessage}} and in the callback to that marks the node
as alive. This works great, except when that initial echo fails for w.e. reason and that node
is marked as dead, in which case it will remain dead for a long while.
> We mostly see this on 100+ node clusters, and almost always when nodes are in different
datacenters that have unreliable network connections (e.g, cross region in AWS) and I think
that it comes down to a combination of:
> 1. Only a node itself can mark another node as "UP"
> 2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}}
> In particular the algorithm in #2 leads to long convergence times because the number
of dead nodes it typically very small compared to the cluster size. My back of the envelope
model of this algorithm indicates that for a 100 node cluster this would take an average of
~50 seconds with a stdev of 50 seconds, which means we might be waiting _minutes_ for the
nodes to gossip with each other. I'm modeling this as the minimum of two [geometric distributions|]
with parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter {{p=1-(1-(1/#nodes)^2)}}.
So for a 100 node cluster:
> {noformat}
> 100 node cluster =>
> X = Pr(node1 gossips with node2) = geom(0.01)
> Y = Pr(node 2 gossips with node1) = geom(0.01)
> Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02)
> E[Z] = 1/0.02 = 50
> V[Z] = (1-0.02)/(0.02)^2 = 2450
> 1000 node cluster ->
> Z = geom(1 - (1 - 0.001)^2) = geom(0.002)
> E[Z] = 500
> V[Z] = 24500
> {noformat}
> Since we gossip every second that means that on expectation in a 100 node cluster these
nodes would see each other after about a minute and in a thousand node cluster, after ~8 minutes.
For 100 node clusters the variance is astounding, and means that in particular edge cases
we might be waiting hours before these nodes gossip with each other.
> I'm thinking of writing a patch which either:
> # Makes gossip order a shuffled list that includes dead nodes a la [swim gossip|].
This would make it so that we waste some rounds on dead nodes but guarantee linear bounding
of gossip.
> # Adds an endpoint that re-triggers gossip with all nodes. Operators could call this
after a restart a few times if they detect a gossip inconsistency.
> # Bounding the probability we gossip with a dead node at some reasonable number like
1/10 or something. This might cause a lot of gossip load when a node is actually down for
large clusters, but would also act to bound the variance.
> # Something else?
> I've got a WIP [branch|]
on 3.11 which implements options #1 and #2, but I can reduce/change/modify as needed if people
think there is a better way. The patch doesn't pass tests yet but I'm not going to change/add
the tests unless we think moving to time bounded gossip for down nodes is a good idea.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message