incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: RFC: Cassandra Virtual Nodes
Date Tue, 20 Mar 2012 13:39:01 GMT
I like this idea.  It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort.  More like 5% of the effort.  I can't
even enumerate all the places full vnode support would change, but an
"active token range" concept would be relatively limited in scope.

Full vnodes feels a lot more like the counters quagmire, where
Digg/Twitter worked on it for... 8? months, and then DataStax worked
on it about for about 6 months post-commit, and we're still finding
the occasional bug-since-0.7 there.  With the benefit of hindsight, as
bad as maintaining that patchset was out of tree, committing it as
early as we did was a mistake.  We won't do that again.  (On the
bright side, git makes maintaining such a patchset easier now.)

On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson <rbranson@datastax.com> wrote:
> I think if we could go back and rebuild Cassandra from scratch, vnodes
> would likely be implemented from the beginning. However, I'm concerned that
> implementing them now could be a big distraction from more productive uses
> of all of our time and introduce major potential stability issues into what
> is becoming a business critical piece of infrastructure for many people.
> However, instead of just complaining and pedantry, I'd like to offer a
> feasible alternative:
>
> Has there been consideration given to the idea of a supporting a single
> token range for a node?
>
> While not theoretically as capable as vnodes, it seems to me to be more
> practical as it would have a significantly lower impact on the codebase and
> provides a much clearer migration path. It also seems to solve a majority
> of complaints regarding operational issues with Cassandra clusters.
>
> Each node would have a lower and an upper token, which would form a range
> that would be actively distributed via gossip. Read and replication
> requests would only be routed to a replica when the key of these operations
> matched the replica's token range in the gossip tables. Each node would
> locally store it's own current active token range as well as a target token
> range it's "moving" towards.
>
> As a new node undergoes bootstrap, the bounds would be gradually expanded
> to allow it to handle requests for a wider range of the keyspace as it
> moves towards it's target token range. This idea boils down to a move from
> hard cutovers to smoother operations by gradually adjusting active token
> ranges over a period of time. It would apply to token change operations
> (nodetool 'move' and 'removetoken') as well.
>
> Failure during streaming could be recovered at the bounds instead of
> restarting the whole process as the active bounds would effectively track
> the progress for bootstrap & target token changes. Implicitly these
> operations would be throttled to some degree. Node repair (AES) could also
> be modified using the same overall ideas provide a more gradual impact on
> the cluster overall similar as the ideas given in CASSANDRA-3721.
>
> While this doesn't spread the load over the cluster for these operations
> evenly like vnodes does, this is likely an issue that could be worked
> around by performing concurrent (throttled) bootstrap & node repair (AES)
> operations. It does allow some kind of "active" load balancing, but clearly
> this is not as flexible or as useful as vnodes, but you should be using
> RandomPartitioner or sort-of-randomized keys with OPP right? ;)
>
> As a side note: vnodes fail to provide solutions to node-based limitations
> that seem to me to cause a substantial portion of operational issues such
> as impact of node restarts / upgrades, GC and compaction induced latency. I
> think some progress could be made here by allowing a "pack" of independent
> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> entirely) similar to a pre-fork model used by some UNIX-based servers.
>
> Input?
>
> --
> Rick Branson
> DataStax



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Mime
View raw message