cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3829) make seeds *only* be seeds, not special in gossip
Date Mon, 13 Feb 2012 19:08:59 GMT


Peter Schuller commented on CASSANDRA-3829:

Okay, I'm with you so far. But as you note, this impacts the usability of single-node clusters
which is where virtually everybody starts. So, I'll need to see a solution that doesn't make
life more confusion for that overwhelming majority. I get that you don't like the current
tradeoffs but I haven't seen a better proposal yet. (I'll go ahead and pre-emptively -1 pecial
environment variables...)

I haven't been able to come up with a solution that avoids the initial setup requiring special
actions. While I am personally fine with this (any software that doesn't would cause me to
wonder "what? what if this wasn't an initial setup?") I understand that 99% of users would
probably not be fond of this behavior and it would just turn people off of Cassandra.

So, what about an opt-in setting which explicitly says the inverse - this *is* a production
cluster that is not being set up? The recommendation could be that everyone uses this setting
after a cluster is in production, but things keep working if they don't (subject to the risks
associated with re-bootstrapping someone on the seed list, a problem we already have).

This could be either a {{cassandra.yaml}} option or, if that is deemed too visible/confusing,
a not-so-prominently-documented environment variable. However, if a documented {{cassandra.yaml}}
option in the default config is not acceptable, I think I'd still prefer a {{cassandra.yaml}}
setting that wasn't in the default configuration to an environment variable above an environment

(This is another case where it doesn't really matter *to me*. We can easily just patch in
the env variable and run with it on our end, it's not like that patch will be a maintenance
problem for us. I really just want to try to make this safer for all users.)

I still haven't seen a case when this, or special-casing seeds to prevent gossip partitions,
causes real problems. Whereas I was around when we added the gossip-partition-prevention code,
so I do know the problems that prevents.

Jumping into clusters/rolling restarts:

So I can give anecdotal stories about seeing people, multiple times, being unaware and/or
confused about a node jumping into a cluster without bootstrapping and not realizing what's
going on, or tell you that a long time ago before I knew enough about gossip I was feeling
the pains of rolling restarts whenever maintenance was done on clusters.

But in this case it seems better to just have it flow from actual facts because it's not really
that subjective. Consider the combination of:

* Restarts are in fact required in change seeds.
* A restart can easily be very very slow due to index sampling (until the samples-on-disk
patch is in), row cache pre-load, commit log replay (not if you drained properly though),
* A restart can also be problematic if it e.g. causes page cache eviction and thus necessitates
rate limiting rolling restarts.
* Completing rolling restarts in a safe manner is prevented by pre-existing nodes being down
in the cluster depending (e.g., RF=3 QUORUM, one node already down -> can't restart neighbors).
* In addition, all forms of restarts carry with it some risk, even if we were to only consider
the risk involved in terms of adding additional windows of potential double failures.

Having to do a full rolling restart on a production cluster, particularly if the cluster has
a lot of data (-> slower restarts, more sensitive to page caches, etc), is a *huge* operation
to do just because you needed to e.g. replace a broken disk in and rebootstrap a node that
just happened to be a seed. And clearly, the probability that *some* other node in the cluster
is currently down for whatever reason in a large cluster is non-trivial, and would cause the
inability to not be able to complete a orlling restart.

Of course one might again argue that there is no real need to be that strict on maintaining
the seed list, but again the circumstances under which this is safe is very opaque to people
not intimately familiar with the code - and not being strict about it kind of takes away the
protection against partitions it was supposed to give you from the start.

So, while I realize changing the role of seeds is more controversial, I have a hard time understanding
how it cannot be obviously better to allow seeds to be reloadable? Pushing a .yaml configuration
file vs. a *complete rolling restart of the entire cluster* - that's a huge difference in
impact, effort and risk for most production clusters.

> make seeds *only* be seeds, not special in gossip 
> --------------------------------------------------
>                 Key: CASSANDRA-3829
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
> First, a little bit of "framing" on how seeds work:
> The concept of "seed hosts" makes fundamental sense; you need to
> "seed" a new node with some information required in order to join a
> cluster. Seed hosts is the information Cassandra uses for this
> purpose.
> But seed hosts play a role even after the initial start-up of a new
> node in a ring. Specifically, seed hosts continue to be gossiped to
> separately by the Gossiper throughout the life of a node and the
> cluster.
> Generally, operators must be careful to ensure that all nodes in a
> cluster are appropriately configured to refer to an overlapping set of
> seed hosts. Strictly speaking this should not be necessary (see
> further down though), but is the general recommendation. An
> unfortunate side-effect of this is that whenever you are doing ring
> management, such as replacing nodes, removing nodes, etc, you have to
> keep in mind which nodes are seeds.
> For example, if you bring a new node into the cluster, doing
> everything right with token assignment and auto_bootstrap=true, it
> will just enter the cluster without bootstrap - causing inconsistent
> reads. This is dangerous.
> And worse - changing the notion of which nodes are seeds across a
> cluster requires a *rolling restart*. It can be argued that it should
> actually be okay for nodes other than the one being fiddled with to
> incorrectly treat the fiddled-with node as a seed node, but this fact
> is highly opaque to most users that are not intimately familiar with
> Cassandra internals.
> This adds additional complexity to operations, as it introduces a
> reason why you cannot view the ring as completely homogeneous, despite
> the fundamental idea of Cassandra that all nodes should be equal.
> Now, fast forward a bit to what we are doing over here to avoid this
> problem: We have a zookeeper based systems for keeping track of hosts
> in a cluster, which is used by our Cassandra client to discover nodes
> to talk to. This works well.
> In order to avoid the need to manually keep track of seeds, we wanted
> to make seeds be automatically discoverable in order to eliminate as
> an operational concern. We have implemented a seed provider that does
> this for us, based on the data we keep in zookeeper.
> We could see essentially three ways of plugging this in:
> * (1) We could simply rely on not needing overlapping seeds and grab whatever we have
when a node starts.
> * (2) We could do something like continually treat all other nodes as seeds by dynamically
changing the seed list (involves some other changes like having the Gossiper update it's notion
of seeds.
> * (3) We could completely eliminate the use of seeds *except* for the very specific purpose
of initial start-up of an unbootstrapped node, and keep using a static (for the duration of
the node's uptime) seed list.
> (3) was attractive because it felt like this was the original intent
> of seeds; that they be used for *seeding*, and not be constantly
> required during cluster operation once nodes are already joined.
> Now before I make the suggestion, let me explain how we are currently
> (though not yet in production) handling seeds and start-up.
> First, we have the following relevant cases to consider during a normal start-up:
> * (a) we are starting up a cluster for the very first time
> * (b) we are starting up a new clean node in order to join it to a pre-existing cluster
> * (c) we are starting up a pre-existing already joined node in a pre-existing cluster
> First, we proceeded on the assumption that we wanted to remove the use
> of seeds during regular gossip (other than on initial startup). This
> means that for the (c) case, we can *completely* ignore seeds. We
> never even have to discover the seed list, or if we do, we don't have
> to use them.
> This leaves (a) and (b). In both cases, the critical invariant we want
> to achieve is that we must have one or more *valid* seeds (valid means
> for (b) that the seed is in the cluster, and for (a) that it is one of
> the nodes that are part of the initial cluster setup).
> In the (c) case the problem is trivial - ignore seeds.
> In the (a) case, the algorithm is:
> * Register with zookeeper as a seed
> * Wait until we see *at least one* seed *other than ourselves* in zookeeper
> * Continue regular start-up process with the seed list (with 1 or more seeds)
> In the (b) case, the algorithm is:
> * Wait until we see *at least one* seed in zookeeper
> * Continue regular start-up process with the seed list (with 1 or more seeds)
> * Once fully up (around the time we listen to thrift), register as a seed in zookeeper
> With the annoyance that you have to explicitly let Cassandra know that
> "I am starting a cluster for the very first time from scratch", and
> ignoring the problem of single node clusters (just to avoid
> complicating this post further), this guarantees in both cases that
> all nodes eventually see each other.
> In the (a) case, all nodes except one are guaranteed to see the "one"
> node. The "one" node is guaranteed to see one of the others. Thus -
> convergence.
> In the (b) case, it's simple - the new node is guaranteed to see one
> or more nodes that are in the cluster - convergence.
> The current status is that we have implemented the seed provider and
> the start-up sequence works. But in order to simplify Cassandra (and
> to avoid having to diverge), we propose that we take this to its
> conclusion and officially make seeds only relevant on start-up, by
> only ever gossiping to seeds when in pre-bootstrap mode during
> start-up.
> The perceived benefits are:
> * Simplicity for the operator. All nodes are equal once joined; you can almost forget
completely about seeds.
> * No rolling restarts or potential for footshooting a node into a cluster without bootstrap
because it happened to be a seed.
> * Production clusters will suddenly start to actually *test* the gossip protocol without
relying on seeds. How sure are we that it even works, and that phi conviction is appropriate
and RING_DELAY is appropriate, given that practical clusters tend to gossip to a random (among
very few) seeds? This change would make it so that we *always* gossip randomly to anyone in
the cluster, and there should be no danger that a cluster happens to hold together because
seeds are up - only to explode when they are not.
> * It eliminates non-trivial concerns with automatic seed discover, particularly when
you want that seed discovery to be rack and DC aware. All you care about it what was described
above; if that seed happens to fail, we simply fail to find the cluster and can abort start-up
and it can be retried. There is no need for "redundancy" in seeds.
> Thoughts? Are seeds important (by design) in some way other than for seeding? What do
other people think about the implications of RING_DELAY etc?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message