incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Overton <>
Subject Re: RFC: Cassandra Virtual Nodes
Date Tue, 10 Apr 2012 13:38:48 GMT
There is now a parent ticket for this issue in JIRA:

Comments and contributions are still welcome!



On 16 March 2012 23:38, Sam Overton <> wrote:
> Hello cassandra-dev,
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
> The summary is: we believe virtual nodes are the way forward. We would like
> to add virtual nodes to Cassandra and we are asking for comments, criticism
> and collaboration!
> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
> decommission, repair and re-balance operations, and places the burden on
> users to properly calculate tokens (a common cause of mistakes), which is a
> recurring pain-point.
> Virtual nodes have a variety of benefits over the one-to-one mapping of host
> to key range which Cassandra currently supports.
> Among these benefits are:
> * Even load balancing when growing and shrinking the cluster
> A virtual node scheme ensures that all hosts in a cluster have an even
> portion of the total data, and a new node bootstrapped into the cluster will
> assume its share of the data. Doubling, or halving the cluster to ensure
> even load distribution would no longer be necessary.
> * Distributed rebuild
> When sizing a cluster, one of the considerations is the amount of time
> required to recover from a failed node. This is the exposure time, during
> which a secondary failure could cause data loss. In order to guarantee an
> upper bound on the exposure time, the amount of data which can be stored on
> each host is limited by the amount of time taken to recover the required
> replica count. At Acunu we have found that the exposure time is frequently
> the limiting factor which dictates the maximum allowed node size in
> customers' clusters.
> Using a virtual node scheme, the data stored on one host is not replicated
> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
> other virtual nodes which may be on a different set of physical hosts to
> replicas of other virtual nodes stored on the same host. This means data for
> one host is replicated evenly across the entire cluster.
> In the event of a failure then, restoring the replica count can be done in a
> fully distributed way. Each host in the cluster participates in the rebuild,
> drastically reducing the exposure time, allowing more data to be stored on a
> single host while still maintaining an acceptable upper bound on the
> likelihood of secondary failure. This reduces TCO concerns.
> * Greater failure tolerance in streaming
> Operations which require streaming of a large range of data, eg. bootstrap,
> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
> network connection) is encountered during the streaming. Currently the whole
> range must be re-streamed, and this could constitute a very large amount of
> data. Virtual nodes reduce the impact of streaming failures, since each
> virtual node is a much smaller range of the key-space, so re-streaming a
> whole virtual node is a much cheaper process.
> * Evenly distributed impact of streaming operations
> Streaming operations such as bootstrap, repair, et al. would involve every
> node in the cluster. This would distribute the load of these operations
> across the whole cluster, and could be staggered so that only a small subset
> of nodes were affected at once, similar to staggered repair[1].
> * Possibility for active load balancing
> Load balancing in Cassandra currently involves moving a token to
> increase/reduce the amount of key-space for which a host is responsible.
> This only allows load balancing between neighbouring nodes, so it could
> involve moving more than one token just to redistribute a single overloaded
> node. Virtual nodes could allow load balancing on a much finer granularity,
> so heavily loaded portions of the key-space could be redistributed to
> lighter-loaded hosts by reassigning one or more virtual nodes.
> Implementing a virtual node scheme in Cassandra is not an insignificant
> amount of work, and it will touch a large amount of the codebase related to
> partitioning, placement, routing, gossip, and so on. We do believe that this
> is possible to do incrementally, and in such a way that there is an easy
> upgrade path for pre-virtual-node deployments.
> It would not however touch the storage layer. The virtual node concept is
> solely for partitioning and placement, not for segregating the data storage
> of the host, so all keys for all virtual nodes on a host would be stored in
> the same SSTables.
> We are not proposing the adoption of the same scheme used by Voldemort[2]
> and described in the Dynamo paper[3]. We feel this scheme is too different
> from Cassandra's current distribution model to be a viable target for
> incremental development. Their scheme also fixes the number of virtual nodes
> for the lifetime of the cluster, which can prove to be a ceiling to scaling
> the cluster if the virtual nodes grow too large.
> The proposed design is:
> * Assign each host T random tokens.
> * A partition is assigned to a host for each of its tokens, where the
> partition is defined by the interval between a token and the previous token
> on the ring.
> * When a host joins the ring it is assigned T random tokens which will
> result in a portion of an existing partition being assigned to that host.
> * When a host leaves the ring it relinquishes its tokens which will result
> in its partitions becoming part of the neighbouring partitions.
> This is just a basic extension of Cassandra's existing distribution model,
> where instead of having 1 token per host, there are many tokens per host. It
> is the same scheme used by libketama[4] for consistent hashing among
> memcached instances, and is also the original scheme used by Dynamo as
> described in [3] before they migrated to their current scheme with fixed
> partitions.
> The random assignment of tokens may seem unintuitive given that currently in
> Cassandra a random token assigment leads to an unbalanced cluster. With many
> virtual nodes, a random token assignment leads to load being evenly balanced
> across the hosts in the cluster with high probability. As the number of
> virtual nodes is increased, the variance in load across hosts decreases, as
> demonstrated by simulation in [5].
> This scheme has the following properties - (where N is the number of hosts
> and B is the total data stored in the cluster):
> * placement metadata size is O(N) which is the same as in Cassandra
> currently
> * partition size is O(B/N) so as data is inserted, if individual partitions
> become too large then adding nodes to the cluster reduces this.
> * the strategy shares the following properties in common with Cassandra
> currently
> ** tokens are randomly assigned
> ** partitioning is determined by placement (and vice-versa)
> ** no two nodes may share the same token
> ** when a node leaves the ring, all of its tokens are removed - there is no
> exchanging of partitions between nodes
> One design concern is that replicas of a key range are not stored on the
> same physical host, as failure of that host could cause the loss of more
> than one replica of the data. This will be achieved by using a placement
> strategy very similar the the existing NetworkTopologyStrategy, which treats
> each individual host the same way as NTS treats a rack - that is replicas
> are not assigned to two hosts on the same rack.
> I will shortly create a ticket in JIRA to track discussion of this design.
> We have also done some simulation of this scheme to observe the load
> balancing properties, node size distribution, cluster resizing and so on. I
> will attach some results of this simulation to the JIRA ticket in due
> course.
> We are keen to get the ball rolling on this and we look forward to your
> input, ideas and recommendations.
> Best Regards,
> Sam Overton
> [1] Staggering repair:
> [2] Project Voldemort, Design:
> [3] Dynamo:
> [4] Ketama: Consistent Hashing:
> [5] Consistent Hashing:
> --
> Sam Overton
> Acunu | | @acunu

Sam Overton
Acunu | | @acunu

View raw message