> a) a virtual node partitioning scheme (to support heterogeneity and
> management simplicity)
> b) topology aware replication
> c) topology aware routing
I would add (d) limiting the distribution factor to decrease the
probability of data loss/multiple failures within a replica set.
> First of all, I think that while (c) depends on (b) it does not affect
> partitioning or replication directly, so I'm going to set that aside
> for the moment and talk just about the former two.
Agreed (but I think (d) relates).
> 1. The hashspace is partitioned into a fixed number of partitions
> 2. The CRUSH algorithm is run  select(1, disk)  over the topology
> using each partition as a key, to get an assignment of partition >
> physical host (primary)
> 2a. adding or removing a node requires rerunning CRUSH to recalculate
> the partition assignment (and move data)
(or any other arbitrary change, yes)
> 3. The CRUSH algorithm is run  select(RF1, disk)  over the topology
> using each primary host id, to get an assignment of primary host >
> RF1 replicas
> 3a. adding or removing a node requires rerunning CRUSH to recalculate
> replica assignment (which might be a different set of hosts to
> before?)
Yes. Cassandra would have a minimum of two topologies; the "current"
and the "next" topology. Each would imply a mapping of partition >
replica set, and that mapping will potentially be different between
the two.
Reads would always be served form the "current" topology. Writes would
go to the union of the current and the next topology, taking care to
"tie" replicas together correctly for consistency level purposes (this
is what CASSANDRA3901 and CASSANDRA3833 are talking about).
Any topology change is treated the same from the read/write path
perspective, regardless of whether you're adding a node, removing a
node, adding an entire rack, or even an entire data center. No added
complexity is introduced beyond the "base case".
> One of my concerns about using CRUSH exactly as described in the paper
> is that it seems to be suboptimal in the amount of data that it moves
> after modifying the topology. The authors of the paper introduce
> several "bucket types" (uniform, list, tree, straw) which appear to be
> various suboptimal alternatives to consistent hashing, with various
> tradeoffs. Why not use consistent hashing? Given (2a) and (3a) I
> think we might end up moving way too much data when the set of
> replicas changes completely for a given host.
One of the benefits of prepartitioning to a fixed set of partitions
is that we can precalculate the mapping. This removes the CPU
efficiency tradeoff of the straw bucket, and the straw bucket would
be a good choice.
Consistent hashing: It's totally doable to use consistent hashing at
each node in the topology. It is not without its own tradeoffs
though, because the granularity of weighting you want to support, and
the accurace of it, relates directly to the number of vnodes per child
you need to keep in your consistent hashing ring. Taking granularity,
accuracy target and number of children into account can easily lead to
very large amounts of vnodes.
(At least experimentally from when I've implemented and played with
the simple form of consistent hashing in the past. I don't currently
have good mathematical evidence.)
> Let's suppose we introduce our own bucket type called a "ring bucket".
> Each item in a ring bucket is assigned an equal, noncontiguous
> portion of the key hashspace, which determines which keys are
> assigned to it. When an item is added to the ring bucket, it takes an
> equal portion of the hashspace from every other item already in the
> bucket. And viceversa for removals. It's easy to see that this ring
> bucket implements consistent hashing with some unspecified virtual
> node scheme. Additions and removals would be optimal (only \deltaw/W
> keys require moving when the topology changes).
This is my understanding of what you meant by consistent hashing, and
what I refer to above.
> Using this ring bucket in the CRUSH topology, (with the hash function
> being the identity function) would give the exact same distribution
> properties as the virtual node strategy that I suggested previously,
> but of course with much better topology awareness.
I will have to reread your orignal post. I seem to have missed something :)
> This makes it evident that the partitioning scheme, and a CRUSHlike
> replication scheme are orthogonal concerns. In the same way as NTS
> currently uses the ring to provide distribution at DC and rack level
> by conceptually separating the ring into a distinct logical rings for
> each DC, a CrushReplicationStrategy could use the ring as its
> bucketing function to distribute partitions in the topology.
Yes, agreed. Also, the distribution factor limiting is also compatible
with noncrush by hash chaining from the primary replica instead of
the row key.
> This brings me on to (1) and the reasons for our choice of virtual
> node scheme  choose N random tokens  instead of the Dynamolike
> scheme that you suggest where the partitions are fixed in advance.
> With the Dynamo scheme, the size of a virtual node partition will only
> ever grow as more data is inserted. Since the number of partitions is
> fixed when the cluster is created, the partition size is unbounded.
I'm not sure I see why my suggestion is dynamo like. In what way? It
is essentially random in the sense of converging toward uniformity,
but with the difference that the mapping is produced deterministically
(and in a stable fashion).
> There are certain advantages to having a limit on partition size.
> Streaming failures that cause retries do not have to resend so much
> data. Streaming operations can be staggered in smaller chunks to
> minimise the impact on the nodes involved. Load balancing can operate
> on a finer granularity.
In my scheme the limit would be implicit in the number of partitions
(combined with cluster size). And yes, this is a very good property
IMO. Relatedly, as mentioned, is that you can make sure to structure
the data in terms of these partitions. In terms of current Cassandra,
this means that "cleanup" is an instantaneous operation instead of an
expensive operation that has to truck through a lot of data (less so
with leveled compaction).
> The other concern you mentioned was
>> The probability of data loss increases linearly with cluster size.
>
> but you also acknowledge that
>
>> In making this determination, one must take into account that if a
>> larger `DF` makes reconstruction/replacement significantly faster,
>> that also decreases the time window in which multiple failures can
>> occurr. Increasing `DF` is thus not *necessarily* increasing the total
>> probability of data loss (for small values of `DF`).
>
> Our calculations lead us to believe that in fact the shorter rebuild
> window more than compensates for the increased probability of multiple
> failure, so with DF=N the probability of data loss is minimised.
>
> The CRUSH paper also states:
>
> "With 2way mirroring these two factors cancel
> each other out, while overall data safety with more than two
> replicas increases with declustering [Xin et al. 2004]"
>
> ("declustering" meaning increasing DF towards N)
This is common argument against limiting RDF, but I am *strongly*
skeptical of this in reallife scenarios. This is one of those cases
where I think one needs to look beyond the pure math. For one thing, a
huge concern for me is that I don't want active rebalancing to have
such an extreme availability requirement that a downtime on actively
doing that implies a significantly increased risk of data loss. I
don't want a system to be constantly teetering on the edge of data
loss, and it not even being safe to *shut it down* because your lack
of data loss is dependent on the system as a whole being fully
available, working and actively moving data around. It also hinges on
reality matching theory perfectly well in terms of "independent"
failures truly being independent. I would feel much more comfortable
with a system that did not rely on superfast rereplication to ensure
data safety. But like I said, a lot of people make this argument  I
just remain unconvinced thus far.
Further, even looking at just the math, the claim cannot possibly hold
as N grows sufficiently large. At some point you will bottleneck on
the network and no longer benefit form a higher RDF, but the
probability of data loss doesn't drop off until you reach DF=number of
partitions (because at that point an increased cluster size doesn't
increase the number of nodes with data sharing with another node).

/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
