I've been wondering about virtual nodes and how cluster uptime might change as cluster size increases.
I understand clusters will benefit from increased reliability due to faster rebuild time, but does that hold true for large clusters?
It seems that since (and correct me if I'm wrong here) every physical node will likely share some small amount of data with every other node, that as the count of physical nodes in a Cassandra cluster increases (let's say into the triple digits) that the probability of at least one failure to Quorum read/write occurring in a given time period would *increase*.
Would this hold true, at least until physical nodes becomes greater than num_tokens per node?
I understand that the window of failure for affected ranges would probably be small but we do Quorum reads of many keys, so we'd likely hit every virtual range with our queries, even if num_tokens was 256.