On 20 March 2012 14:50, Rick Branson <rbranson@datastax.com> wrote:
> To support a form of DF, I think some tweaking of the replica placement could achieve
this effect quite well. We could introduce a variable into replica placement, which I'm going
to incorrectly call DF for the purposes of illustration. The key range for a node would be
subdivided by DF (1 by default) and this would be used to further distribution replica selection
based on this "subpartition".
>
> Currently, the offset formula works out to be something like this:
>
> offset = replica
>
> For RandomPartitioner, DF placement might look something like:
>
> offset = replica + (token % DF)
>
> Now, I realize replica selection is actually much more complicated than this, but these
formulas are for illustration purposes.
>
> Modifying replica placement & the partitioners to support this seems straightforward,
but I'm unsure of what's required to get it working for ring management operations. On the
surface, it does seem like this could be added without any kind of difficult migration support.
>
> Thoughts?
This solution increases the DF, which has the advantage of providing
some balancing when a node is down temporarily. The reads and writes
it would have served are now distributed around ~DF nodes.
However, it doesn't have any distributed rebuild. In fact, any
distribution mechanism with one token per node cannot have distributed
rebuild. Should a node fail, the next node in the ring has twice the
token range so must have twice the data. This node will limit the
rebuild time  'nodetool removetoken' will have to replicate the data
of the failed node onto this node.
Increasing the distribution factor without speeding up rebuild
increases the failure probability  both for data loss or being unable
to reach required consistency levels. The failure probability is a
tradeoff between rebuild time and distribution factor. Lower rebuild
time helps, and lower distribution factor helps.
Cassandra as it is now has the longest rebuild time and lowest
possible distribution factor. The original vnodes scheme is the other
extreme  shortest rebuild time and largest possible distribution
factor.
It turns out that the rebuild time is more important, so this
decreases failure probability (with some assumptions you can show it
decreases by a factor RF!  I'll spare you the math but can send it if
you're interested).
This scheme has the longest rebuild time and a (tuneable) distribution
factor, but larger than the lowest. That necessarily increases the
failure probability over both Cassandra now and vnode schemes, so I'd
be very careful about choosing it.
Richard.
