cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9206) Remove seed gossip probability
Date Fri, 17 Apr 2015 20:01:59 GMT


Jason Brown commented on CASSANDRA-9206:

TBH, I'm kinda +0 on this ticket. While I agree the original motivation behind the probabalistic
desire to contact seeds is a bit spurious/funky/undocumented, I'm not compltetly convinced
adding more traffic will help much in cluster convergence. For small clusters (less than 20
nodes), there will be near zero impact, so I don't have much problem in that case - but then,
they probably don't suffer from the problems we're trying to address here. 

However, for larger clusters (greater than 500 nodes), think the extra messaging might be
an issue. The problem I see is that when things slow down, and you have a very low number
of seed nodes (i.e. less than 5), the gossip messages will back up on those nodes and we'll
spend lot of cycles just trying to broadcast the same redundant data over and over again.
What's worse is that the operator won't really have any great insight to discover that gossip
(our membership dissemination protocol) is contributing to things going weird; and, thus,
the advice to "add more seeds" isn't obvious nor simple, in some cases. (I'm thinking of Netflix's
Priam programmed to use up to two nodes per availability zone as seeds. It would require a
non-trivial effort to change that core assumption, fwiw.) Further, in 3.0, we've now split
the OTCP by message size, not function. Thus, all the excess gossip messages on the seeds
could start interfering with the normal read/write traffic.

Also, we will not create a spanning tree by increasing the number of nodes contacted during
a gossip round. What that does is increase the fanout (the number of nodes contacted) from
a fixed size of 1 to 2. We still have randomly selected peers at every step, and not a static
nor dynamic tree that covers all nodes from a given sender.

Lastly, there is a minor error in the number of messages to be generated: in a cluster of
1000 nodes, we will start 1000 more gossip sessions to the seeds, and each gossip session
is comprised of 3 messages. Thus, the message count is 3000. If you are actually running a
cluster that large, and the network can't sustain that extra load, you're probably screwed

While this might help in convergence (primarily for heartbeat dissemination), the trade off
is for more (non-directed) traffic. All in all (and thinking while I'm typing), this patch
is probably fine for the vast majority of use cases, and if anything, the clarity in the code
that will come from it should be worthwhile.

> Remove seed gossip probability
> ------------------------------
>                 Key: CASSANDRA-9206
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.1.5
>         Attachments: 9206.txt
> Currently, we use probability to determine whether a node will gossip with a seed:
> {noformat} 
>                 double probability = seeds.size() / (double) (liveEndpoints.size() +
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
> {noformat}
> I propose that we remove this probability, and instead *always* gossip with a seed. 
This of course means increased traffic and processing on the seed(s), but even a 1000 node
cluster with a single seed will only put ~1000 messages per second on the seed, which is virtually
nothing.  Should it become a problem, the solution is simple: add more seeds.  Since seeds
will also always gossip with each other, this effectively gives us a poor man's spanning tree,
with the only cost being removing a few lines of code, and should greatly improve our gossip
convergence time, especially in large clusters.

This message was sent by Atlassian JIRA

View raw message