You're right, the math does assume independence which is unlikely to be accurate. But if you do have correlated failure modes e.g. same power, racks, DC, etc. then you can still use Cassandra's rack-aware or DC-aware features to ensure replicas are spread around so your cluster can survive the correlated failure mode. So I would expect vnodes to improve uptime in all scenarios, but haven't done the math to prove it.

Richard.

On 9 December 2012 17:50, Tyler Hobbs <tyler@datastax.com> wrote:

Nicolas,

Strictly speaking, your math makes the assumption that the failure of different nodes are probabilistically independent events. This is, of course, not a accurate assumption for real world conditions. Nodes share racks, networking equipment, power, availability zones, data centers, etc. So, I think the mathematical assertion is not quite as strong as one would like, but it's certainly a good argument for handling certain types of node failures.

--On Fri, Dec 7, 2012 at 11:27 AM, Nicolas Favre-Felix <nicolas@acunu.com> wrote:

Hi Eric,Your concerns are perfectly valid.We (Acunu) led the design and implementation of this feature and spent a long time looking at the impact of such a large change.We summarized some of our notes and wrote about the impact of virtual nodes on cluster uptime a few months back: http://www.acunu.com/2/post/2012/10/improving-cassandras-uptime-with-virtual-nodes.html.The main argument in this blog post is that you only have a failure to perform quorum read/writes if at least RF replicas fail within the time it takes to rebuild the first dead node. We show that virtual nodes actually decrease the probability of failure, by streaming data from all nodes and thereby improving the rebuild time.Regards,NicolasOn Wed, Dec 5, 2012 at 4:45 PM, Eric Parusel <ericparusel@gmail.com> wrote:

Hi all,I've been wondering about virtual nodes and how cluster uptime might change as cluster size increases.I understand clusters will benefit from increased reliability due to faster rebuild time, but does that hold true for large clusters?It seems that since (and correct me if I'm wrong here) every physical node will likely share some small amount of data with every other node, that as the count of physical nodes in a Cassandra cluster increases (let's say into the triple digits) that the probability of at least one failure to Quorum read/write occurring in a given time period would *increase*.Would this hold true, at least until physical nodes becomes greater than num_tokens per node?I understand that the window of failure for affected ranges would probably be small but we do Quorum reads of many keys, so we'd likely hit every virtual range with our queries, even if num_tokens was 256.Thanks,Eric

Tyler Hobbs

DataStax

Richard Low

Acunu | http://www.acunu.com | @acunu