incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Evans <>
Subject Re: RFC: Cassandra Virtual Nodes
Date Tue, 20 Mar 2012 14:08:57 GMT
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis <> wrote:
> I like this idea.  It feels like a good 80/20 solution -- 80% of the
> benefits, 20% of the effort.  More like 5% of the effort.  I can't
> even enumerate all the places full vnode support would change, but an
> "active token range" concept would be relatively limited in scope.

It only addresses 1 of Sam's original 5 points, so I wouldn't call it
an "80% solution".

> Full vnodes feels a lot more like the counters quagmire, where
> Digg/Twitter worked on it for... 8? months, and then DataStax worked
> on it about for about 6 months post-commit, and we're still finding
> the occasional bug-since-0.7 there.  With the benefit of hindsight, as
> bad as maintaining that patchset was out of tree, committing it as
> early as we did was a mistake.  We won't do that again.  (On the
> bright side, git makes maintaining such a patchset easier now.)

And yet counters have become a very important feature for Cassandra;
We're better off with them, than without.

I think there were a number of problems with how counters went down
that could be avoided here.  For one, we can take a phased,
incremental approach, rather than waiting 8 months to drop a large

> On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson <> wrote:
>> I think if we could go back and rebuild Cassandra from scratch, vnodes
>> would likely be implemented from the beginning. However, I'm concerned that
>> implementing them now could be a big distraction from more productive uses
>> of all of our time and introduce major potential stability issues into what
>> is becoming a business critical piece of infrastructure for many people.
>> However, instead of just complaining and pedantry, I'd like to offer a
>> feasible alternative:
>> Has there been consideration given to the idea of a supporting a single
>> token range for a node?
>> While not theoretically as capable as vnodes, it seems to me to be more
>> practical as it would have a significantly lower impact on the codebase and
>> provides a much clearer migration path. It also seems to solve a majority
>> of complaints regarding operational issues with Cassandra clusters.
>> Each node would have a lower and an upper token, which would form a range
>> that would be actively distributed via gossip. Read and replication
>> requests would only be routed to a replica when the key of these operations
>> matched the replica's token range in the gossip tables. Each node would
>> locally store it's own current active token range as well as a target token
>> range it's "moving" towards.
>> As a new node undergoes bootstrap, the bounds would be gradually expanded
>> to allow it to handle requests for a wider range of the keyspace as it
>> moves towards it's target token range. This idea boils down to a move from
>> hard cutovers to smoother operations by gradually adjusting active token
>> ranges over a period of time. It would apply to token change operations
>> (nodetool 'move' and 'removetoken') as well.
>> Failure during streaming could be recovered at the bounds instead of
>> restarting the whole process as the active bounds would effectively track
>> the progress for bootstrap & target token changes. Implicitly these
>> operations would be throttled to some degree. Node repair (AES) could also
>> be modified using the same overall ideas provide a more gradual impact on
>> the cluster overall similar as the ideas given in CASSANDRA-3721.
>> While this doesn't spread the load over the cluster for these operations
>> evenly like vnodes does, this is likely an issue that could be worked
>> around by performing concurrent (throttled) bootstrap & node repair (AES)
>> operations. It does allow some kind of "active" load balancing, but clearly
>> this is not as flexible or as useful as vnodes, but you should be using
>> RandomPartitioner or sort-of-randomized keys with OPP right? ;)
>> As a side note: vnodes fail to provide solutions to node-based limitations
>> that seem to me to cause a substantial portion of operational issues such
>> as impact of node restarts / upgrades, GC and compaction induced latency. I
>> think some progress could be made here by allowing a "pack" of independent
>> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
>> entirely) similar to a pre-fork model used by some UNIX-based servers.
>> Input?
>> --
>> Rick Branson
>> DataStax
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support

Eric Evans
Acunu | | @acunu

View raw message