cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyrylo Lebediev <>
Subject Re: vnodes: high availability
Date Tue, 16 Jan 2018 21:54:07 GMT
Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method
which was implemented in 3.0.

Based on your info and comments in the ticket it's really a bad idea to have small number
of vnodes for the versions using old allocation method because of hot-spots, so it's not an
option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 2.1.]



[CASSANDRA-7032] Improve vnode allocation - ASF JIRA<>
It's been known for a little while that random vnode allocation causes hotspots of ownership.
It should be possible to improve dramatically on this with deterministic ...

From: Jon Haddad <> on behalf of Jon Haddad <>
Sent: Tuesday, January 16, 2018 8:21:33 PM
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s
going to be some imbalance, the amount of imbalance depends on luck, unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the ML know your
experience when you’ve done it?


On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <<>>

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've
never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than
256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .

Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order
to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

From: Jon Haddad <<>>
on behalf of Jon Haddad <<>>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you would be
able to withstand the complete failure of an entire availability zone at QUORUM, or two if
you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend
32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the
math / tested it yet.

On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <<>>
wrote: me it sounds like 'C* isn't that highly-available by design as it's declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse by recommended
setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
In case smb from C* developers/architects is reading this message, I'd be grateful to get
some links to calculations of C* reliability based on which decisions were made.

From: kurt greaves <<>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token
ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes
all with 256 vnodes each, it's very likely that every node will be responsible for some range
A was also responsible for. I'm not sure what the exact math is, but think of it this way:
If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous
RF-1 token ranges) on the ring with a token range on node A those token ranges will be down

Because token range assignment just uses rand() under the hood, I'm sure you could prove that
it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for
some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <<>>
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into
account that vnodes endpoints are almost evenly distributed across all nodes (al least it's
what I can see from "nodetool ring" output).
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram
that can be switched off without losing a key range (at CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a real-life 6-nodes
cluster with 8 vnodes on each node?


From: Alexander Dejanovski <<>>
Sent: Monday, January 15, 2018 8:14:21 PM

Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent
on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <<>>
a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely
random, and the replica for a vnode will be placed on the node that owns the next vnode in
token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49
remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another
one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode
replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica
for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition
currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra
will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack,
you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will
get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in
a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.


On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <<>>
Thanks, Rahul.
But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C,
J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch,

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range
(in case of RF=3)
2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>'
| grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the
same time w/o losing of some keyrange for CL=QUORUM.

From: Rahul Neelakantan <<>>
Sent: Monday, January 15, 2018 5:20:20 PM
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For
example take a look at this diagram

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still
not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range
A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize
a degraded cluster and operate temporarily in downgraded consistency mode

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <<>>

Let's say we have a C* cluster with following parameters:
 - 50 nodes in the cluster
 - RF=3
 - vnodes=256 per node
 - CL for some queries = QUORUM
 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?


Alexander Dejanovski

Apache Cassandra Consulting<>
Alexander Dejanovski

Apache Cassandra Consulting<>

View raw message