Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of soverton@acunu.com designates
 209.85.214.44 as permitted sender)
MIME-Version: 1.0
Reply-To: sam@acunu.com
Date: Fri, 16 Mar 2012 23:38:17 +0000
Message-ID: 
 <CADjM4zvCm97jSiPhn59WsA1k5bgR-SOqcKsBo+_ZMpyRC69DOg@mail.gmail.com>
Subject: RFC: Cassandra Virtual Nodes
From: Sam Overton <sam@acunu.com>
To: dev@cassandra.apache.org
Content-Type: multipart/alternative; boundary=000e0ce0b1b6f5d22104bb64b378

--000e0ce0b1b6f5d22104bb64b378
Content-Type: text/plain; charset=ISO-8859-1

Hello cassandra-dev,

This is a long email. It concerns a significant change to Cassandra, so
deserves a thorough introduction.

*The summary is*: we believe virtual nodes are the way forward. We would
like to add virtual nodes to Cassandra and we are asking for comments,
criticism and collaboration!

Cassandra's current partitioning scheme is sub-optimal for bootstrap,
decommission, repair and re-balance operations, and places the burden on
users to properly calculate tokens (a common cause of mistakes), which is a
recurring pain-point.

Virtual nodes have a variety of benefits over the one-to-one mapping of
host to key range which Cassandra currently supports.

Among these benefits are:

* Even load balancing when growing and shrinking the cluster
A virtual node scheme ensures that all hosts in a cluster have an even
portion of the total data, and a new node bootstrapped into the cluster
will assume its share of the data. Doubling, or halving the cluster to
ensure even load distribution would no longer be necessary.

* Distributed rebuild
When sizing a cluster, one of the considerations is the amount of time
required to recover from a failed node. This is the exposure time, during
which a secondary failure could cause data loss. In order to guarantee an
upper bound on the exposure time, the amount of data which can be stored on
each host is limited by the amount of time taken to recover the required
replica count. At Acunu we have found that the exposure time is frequently
the limiting factor which dictates the maximum allowed node size in
customers' clusters.

Using a virtual node scheme, the data stored on one host is not replicated
on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
other virtual nodes which may be on a different set of physical hosts to
replicas of other virtual nodes stored on the same host. This means data
for one host is replicated evenly across the entire cluster.

In the event of a failure then, restoring the replica count can be done in
a fully distributed way. Each host in the cluster participates in the
rebuild, drastically reducing the exposure time, allowing more data to be
stored on a single host while still maintaining an acceptable upper bound
on the likelihood of secondary failure. This reduces TCO concerns.

* Greater failure tolerance in streaming
Operations which require streaming of a large range of data, eg. bootstrap,
decommission, repair, etc. incur a heavy cost if an error (eg. dropped
network connection) is encountered during the streaming. Currently the
whole range must be re-streamed, and this could constitute a very large
amount of data. Virtual nodes reduce the impact of streaming failures,
since each virtual node is a much smaller range of the key-space, so
re-streaming a whole virtual node is a much cheaper process.

* Evenly distributed impact of streaming operations
Streaming operations such as bootstrap, repair, et al. would involve every
node in the cluster. This would distribute the load of these operations
across the whole cluster, and could be staggered so that only a small
subset of nodes were affected at once, similar to staggered repair[1].

* Possibility for active load balancing
Load balancing in Cassandra currently involves moving a token to
increase/reduce the amount of key-space for which a host is responsible.
This only allows load balancing between neighbouring nodes, so it could
involve moving more than one token just to redistribute a single overloaded
node. Virtual nodes could allow load balancing on a much finer granularity,
so heavily loaded portions of the key-space could be redistributed to
lighter-loaded hosts by reassigning one or more virtual nodes.


Implementing a virtual node scheme in Cassandra is not an insignificant
amount of work, and it will touch a large amount of the codebase related to
partitioning, placement, routing, gossip, and so on. We do believe that
this is possible to do incrementally, and in such a way that there is an
easy upgrade path for pre-virtual-node deployments.

It would not however touch the storage layer. The virtual node concept is
solely for partitioning and placement, not for segregating the data storage
of the host, so all keys for all virtual nodes on a host would be stored in
the same SSTables.

We are not proposing the adoption of the same scheme used by Voldemort[2]
and described in the Dynamo paper[3]. We feel this scheme is too different
from Cassandra's current distribution model to be a viable target for
incremental development. Their scheme also fixes the number of virtual
nodes for the lifetime of the cluster, which can prove to be a ceiling to
scaling the cluster if the virtual nodes grow too large.

The proposed design is:
* Assign each host T random tokens.
* A partition is assigned to a host for each of its tokens, where the
partition is defined by the interval between a token and the previous token
on the ring.
* When a host joins the ring it is assigned T random tokens which will
result in a portion of an existing partition being assigned to that host.
* When a host leaves the ring it relinquishes its tokens which will result
in its partitions becoming part of the neighbouring partitions.

This is just a basic extension of Cassandra's existing distribution model,
where instead of having 1 token per host, there are many tokens per host.
It is the same scheme used by libketama[4] for consistent hashing among
memcached instances, and is also the original scheme used by Dynamo as
described in [3] before they migrated to their current scheme with fixed
partitions.

The random assignment of tokens may seem unintuitive given that currently
in Cassandra a random token assigment leads to an unbalanced cluster. With
many virtual nodes, a random token assignment leads to load being evenly
balanced across the hosts in the cluster with high probability. As the
number of virtual nodes is increased, the variance in load across hosts
decreases, as demonstrated by simulation in [5].

This scheme has the following properties - (where N is the number of hosts
and B is the total data stored in the cluster):
* placement metadata size is O(N) which is the same as in Cassandra
currently
* partition size is O(B/N) so as data is inserted, if individual partitions
become too large then adding nodes to the cluster reduces this.
* the strategy shares the following properties in common with Cassandra
currently
** tokens are randomly assigned
** partitioning is determined by placement (and vice-versa)
** no two nodes may share the same token
** when a node leaves the ring, all of its tokens are removed - there is no
exchanging of partitions between nodes

One design concern is that replicas of a key range are not stored on the
same physical host, as failure of that host could cause the loss of more
than one replica of the data. This will be achieved by using a placement
strategy very similar the the existing NetworkTopologyStrategy, which
treats each individual host the same way as NTS treats a rack - that is
replicas are not assigned to two hosts on the same rack.

I will shortly create a ticket in JIRA to track discussion of this design.
We have also done some simulation of this scheme to observe the load
balancing properties, node size distribution, cluster resizing and so on. I
will attach some results of this simulation to the JIRA ticket in due
course.

We are keen to get the ball rolling on this and we look forward to your
input, ideas and recommendations.

Best Regards,

Sam Overton

[1] Staggering repair: https://issues.apache.org/jira/browse/CASSANDRA-3721
[2] Project Voldemort, Design: http://project-voldemort.com/design.php
[3] Dynamo:
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[4] Ketama: Consistent Hashing:
http://www.audioscrobbler.net/development/ketama/
[5] Consistent Hashing:
http://www.lexemetech.com/2007/11/consistent-hashing.html

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu

--000e0ce0b1b6f5d22104bb64b378--