incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Evans <eev...@acunu.com>
Subject Re: RFC: Cassandra Virtual Nodes
Date Sat, 17 Mar 2012 17:54:07 GMT
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han <schumi.han@gmail.com> wrote:
> On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton <sam@acunu.com> wrote:
>> This is a long email. It concerns a significant change to Cassandra, so
>> deserves a thorough introduction.
>>
>> *The summary is*: we believe virtual nodes are the way forward. We would
>> like to add virtual nodes to Cassandra and we are asking for comments,
>> criticism and collaboration!
>>
>> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
>> decommission, repair and re-balance operations, and places the burden on
>> users to properly calculate tokens (a common cause of mistakes), which is a
>> recurring pain-point.
>>
>> Virtual nodes have a variety of benefits over the one-to-one mapping of
>> host to key range which Cassandra currently supports.
>>
>> Among these benefits are:
>>
>> * Even load balancing when growing and shrinking the cluster
>> A virtual node scheme ensures that all hosts in a cluster have an even
>> portion of the total data, and a new node bootstrapped into the cluster
>> will assume its share of the data. Doubling, or halving the cluster to
>> ensure even load distribution would no longer be necessary.
>>
>> * Distributed rebuild
>> When sizing a cluster, one of the considerations is the amount of time
>> required to recover from a failed node. This is the exposure time, during
>> which a secondary failure could cause data loss. In order to guarantee an
>> upper bound on the exposure time, the amount of data which can be stored on
>> each host is limited by the amount of time taken to recover the required
>> replica count. At Acunu we have found that the exposure time is frequently
>> the limiting factor which dictates the maximum allowed node size in
>> customers' clusters.
>>
>> Using a virtual node scheme, the data stored on one host is not replicated
>> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
>> other virtual nodes which may be on a different set of physical hosts to
>> replicas of other virtual nodes stored on the same host. This means data
>> for one host is replicated evenly across the entire cluster.
>>
>> In the event of a failure then, restoring the replica count can be done in
>> a fully distributed way. Each host in the cluster participates in the
>> rebuild, drastically reducing the exposure time, allowing more data to be
>> stored on a single host while still maintaining an acceptable upper bound
>> on the likelihood of secondary failure. This reduces TCO concerns.
>>
>> * Greater failure tolerance in streaming
>> Operations which require streaming of a large range of data, eg. bootstrap,
>> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
>> network connection) is encountered during the streaming. Currently the
>> whole range must be re-streamed, and this could constitute a very large
>> amount of data. Virtual nodes reduce the impact of streaming failures,
>> since each virtual node is a much smaller range of the key-space, so
>> re-streaming a whole virtual node is a much cheaper process.
>>
>> * Evenly distributed impact of streaming operations
>> Streaming operations such as bootstrap, repair, et al. would involve every
>> node in the cluster. This would distribute the load of these operations
>> across the whole cluster, and could be staggered so that only a small
>> subset of nodes were affected at once, similar to staggered repair[1].
>>
>> * Possibility for active load balancing
>> Load balancing in Cassandra currently involves moving a token to
>> increase/reduce the amount of key-space for which a host is responsible.
>> This only allows load balancing between neighbouring nodes, so it could
>> involve moving more than one token just to redistribute a single overloaded
>> node. Virtual nodes could allow load balancing on a much finer granularity,
>> so heavily loaded portions of the key-space could be redistributed to
>> lighter-loaded hosts by reassigning one or more virtual nodes.
>>
>>
>> Implementing a virtual node scheme in Cassandra is not an insignificant
>> amount of work, and it will touch a large amount of the codebase related to
>> partitioning, placement, routing, gossip, and so on. We do believe that
>> this is possible to do incrementally, and in such a way that there is an
>> easy upgrade path for pre-virtual-node deployments.
>>
>> It would not however touch the storage layer. The virtual node concept is
>> solely for partitioning and placement, not for segregating the data storage
>> of the host, so all keys for all virtual nodes on a host would be stored in
>> the same SSTables.
>>
>> We are not proposing the adoption of the same scheme used by Voldemort[2]
>> and described in the Dynamo paper[3]. We feel this scheme is too different
>> from Cassandra's current distribution model to be a viable target for
>> incremental development. Their scheme also fixes the number of virtual
>> nodes for the lifetime of the cluster, which can prove to be a ceiling to
>> scaling the cluster if the virtual nodes grow too large.
>>
>> The proposed design is:
>> * Assign each host T random tokens.
>>
>
> Will it work well for OrderedPartitioner?

It doesn't preclude the use of an ordered partitioner, but depending
on the size of the range, and the number of tokens (T * nodes), it
could mean contacting more nodes.  You could look at that as a
negative or positive I suppose.

Either way, I'm sure that most people would agree at this point that
it doesn't make sense to optimize for ordered partitioning;
Conventional wisdom is to use the RandomPartitioner

> For load balance purpose, is it possible to detect the access pattern of
> range and move
> the range dynamically instead of introducing virtual node?

Anything is possible, but automating doesn't solve any of the real
problems.  Ultimately you want to avoid having to rebalance in the
first place, and if it can't be avoided, to parallelize it over many
nodes.

-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu

Mime
View raw message