incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhu Han <>
Subject Re: RFC: Cassandra Virtual Nodes
Date Sat, 17 Mar 2012 15:22:40 GMT
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton <> wrote:

> Hello cassandra-dev,
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!
> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
> decommission, repair and re-balance operations, and places the burden on
> users to properly calculate tokens (a common cause of mistakes), which is a
> recurring pain-point.
> Virtual nodes have a variety of benefits over the one-to-one mapping of
> host to key range which Cassandra currently supports.
> Among these benefits are:
> * Even load balancing when growing and shrinking the cluster
> A virtual node scheme ensures that all hosts in a cluster have an even
> portion of the total data, and a new node bootstrapped into the cluster
> will assume its share of the data. Doubling, or halving the cluster to
> ensure even load distribution would no longer be necessary.
> * Distributed rebuild
> When sizing a cluster, one of the considerations is the amount of time
> required to recover from a failed node. This is the exposure time, during
> which a secondary failure could cause data loss. In order to guarantee an
> upper bound on the exposure time, the amount of data which can be stored on
> each host is limited by the amount of time taken to recover the required
> replica count. At Acunu we have found that the exposure time is frequently
> the limiting factor which dictates the maximum allowed node size in
> customers' clusters.
> Using a virtual node scheme, the data stored on one host is not replicated
> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
> other virtual nodes which may be on a different set of physical hosts to
> replicas of other virtual nodes stored on the same host. This means data
> for one host is replicated evenly across the entire cluster.
> In the event of a failure then, restoring the replica count can be done in
> a fully distributed way. Each host in the cluster participates in the
> rebuild, drastically reducing the exposure time, allowing more data to be
> stored on a single host while still maintaining an acceptable upper bound
> on the likelihood of secondary failure. This reduces TCO concerns.
> * Greater failure tolerance in streaming
> Operations which require streaming of a large range of data, eg. bootstrap,
> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
> network connection) is encountered during the streaming. Currently the
> whole range must be re-streamed, and this could constitute a very large
> amount of data. Virtual nodes reduce the impact of streaming failures,
> since each virtual node is a much smaller range of the key-space, so
> re-streaming a whole virtual node is a much cheaper process.
> * Evenly distributed impact of streaming operations
> Streaming operations such as bootstrap, repair, et al. would involve every
> node in the cluster. This would distribute the load of these operations
> across the whole cluster, and could be staggered so that only a small
> subset of nodes were affected at once, similar to staggered repair[1].
> * Possibility for active load balancing
> Load balancing in Cassandra currently involves moving a token to
> increase/reduce the amount of key-space for which a host is responsible.
> This only allows load balancing between neighbouring nodes, so it could
> involve moving more than one token just to redistribute a single overloaded
> node. Virtual nodes could allow load balancing on a much finer granularity,
> so heavily loaded portions of the key-space could be redistributed to
> lighter-loaded hosts by reassigning one or more virtual nodes.
> Implementing a virtual node scheme in Cassandra is not an insignificant
> amount of work, and it will touch a large amount of the codebase related to
> partitioning, placement, routing, gossip, and so on. We do believe that
> this is possible to do incrementally, and in such a way that there is an
> easy upgrade path for pre-virtual-node deployments.
> It would not however touch the storage layer. The virtual node concept is
> solely for partitioning and placement, not for segregating the data storage
> of the host, so all keys for all virtual nodes on a host would be stored in
> the same SSTables.
> We are not proposing the adoption of the same scheme used by Voldemort[2]
> and described in the Dynamo paper[3]. We feel this scheme is too different
> from Cassandra's current distribution model to be a viable target for
> incremental development. Their scheme also fixes the number of virtual
> nodes for the lifetime of the cluster, which can prove to be a ceiling to
> scaling the cluster if the virtual nodes grow too large.
> The proposed design is:
> * Assign each host T random tokens.

Will it work well for OrderedPartitioner?

For load balance purpose, is it possible to detect the access pattern of
range and move
the range dynamically instead of introducing virtual node?

> * A partition is assigned to a host for each of its tokens, where the
> partition is defined by the interval between a token and the previous token
> on the ring.
> * When a host joins the ring it is assigned T random tokens which will
> result in a portion of an existing partition being assigned to that host.
> * When a host leaves the ring it relinquishes its tokens which will result
> in its partitions becoming part of the neighbouring partitions.
> This is just a basic extension of Cassandra's existing distribution model,
> where instead of having 1 token per host, there are many tokens per host.
> It is the same scheme used by libketama[4] for consistent hashing among
> memcached instances, and is also the original scheme used by Dynamo as
> described in [3] before they migrated to their current scheme with fixed
> partitions.
> The random assignment of tokens may seem unintuitive given that currently
> in Cassandra a random token assigment leads to an unbalanced cluster. With
> many virtual nodes, a random token assignment leads to load being evenly
> balanced across the hosts in the cluster with high probability. As the
> number of virtual nodes is increased, the variance in load across hosts
> decreases, as demonstrated by simulation in [5].
> This scheme has the following properties - (where N is the number of hosts
> and B is the total data stored in the cluster):
> * placement metadata size is O(N) which is the same as in Cassandra
> currently
> * partition size is O(B/N) so as data is inserted, if individual partitions
> become too large then adding nodes to the cluster reduces this.
> * the strategy shares the following properties in common with Cassandra
> currently
> ** tokens are randomly assigned
> ** partitioning is determined by placement (and vice-versa)
> ** no two nodes may share the same token
> ** when a node leaves the ring, all of its tokens are removed - there is no
> exchanging of partitions between nodes
> One design concern is that replicas of a key range are not stored on the
> same physical host, as failure of that host could cause the loss of more
> than one replica of the data. This will be achieved by using a placement
> strategy very similar the the existing NetworkTopologyStrategy, which
> treats each individual host the same way as NTS treats a rack - that is
> replicas are not assigned to two hosts on the same rack.
> I will shortly create a ticket in JIRA to track discussion of this design.
> We have also done some simulation of this scheme to observe the load
> balancing properties, node size distribution, cluster resizing and so on. I
> will attach some results of this simulation to the JIRA ticket in due
> course.
> We are keen to get the ball rolling on this and we look forward to your
> input, ideas and recommendations.
> Best Regards,
> Sam Overton
> [1] Staggering repair:
> [2] Project Voldemort, Design:
> [3] Dynamo:
> [4] Ketama: Consistent Hashing:
> [5] Consistent Hashing:
> --
> Sam Overton
> Acunu | | @acunu

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message