cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Goffinet (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair
Date Fri, 11 Nov 2011 05:44:51 GMT


Chris Goffinet commented on CASSANDRA-3483:

Some discussion from irc:

23:43 < goffinet> has datastax ever had a customer add a new datacenter to an existing
cluster? No docs or info on web suggest anyone has done this before
23:44 < driftx> yeah
23:44 < goffinet> how is it done? we are running a case where if i modify strategy options
before adding nodes, writes will fail since no endpoints for DC have been added
23:44 < goffinet> we were expecting this might work because we want to bootstrap the
new DC to the existing cluster
23:44 < goffinet> take on writes + stream data with RF factor
23:45 < driftx> general best practice is (jbellis can correct if I'm outdated) add the
dc at rf:0, add the nodes/update snitch, repair
23:45 < driftx> err, update rf, repair
23:46 < goffinet> yeah mind if i open up a jira? that seems extreme to make the cluster
do that .. ?
23:46 < goffinet> or is repair smart enough to just stream ranges instead of AES?
23:46 < driftx> 'instead of AES?' that's what repair is, but if just streams ranges
23:46 < driftx> s/if/it/
23:47 < goffinet> right but AES builds merkle tree, scans through all data ?
23:47 < goffinet> isn't bootstrap a different operation?
23:47 < goffinet> when streaming just sstables
23:47 < driftx> yeah, it is
23:47 < goffinet> yeah thats more heavy. dont understand why we couldnt use that instead
23:47 < goffinet> like bootstrap
23:48 < stuhood> now that i think about it, it doesn't really make sense that a CL.ONE
write fails if a DC isn't available
23:48 < stuhood> independent of the bootstrap case, that sounds like the real issue
23:49 < stuhood> goffinet: ^
23:50 < driftx> hmm, yeah that doesn't
23:50 < driftx> but the problem with bootstrapping a dc is the first node you bootstrap
gets everything
23:50 < goffinet> stuhood: yeah. it was complaining about not enough endpoints 
23:50 < goffinet> driftx: why is that? if you are doubling the cluster, and assign the
tokens manually ?
23:51 < driftx> still have to do them 2 mins apart, and they're probably going to be
part of the same replica set which I think is troublesome too
23:51 < goffinet> driftx: maybe we can make repair a bit more intelligent? if no data
exists on the node .. just stream the ranges instead of using AES
23:52 < driftx> problem is we're pushing AES to do the entire replica set (which is
nearly does now)
23:52 < stuhood> goffinet: it shouldn't be as heavyweight as you're thinking
23:53 < goffinet> stuhood: but we have a way currently that is less heavy
23:53 < goffinet> i dont understand why we couldnt use that method
23:53 < stuhood> not implemented =)
23:53 < goffinet> don't cut corners :)
23:53 < stuhood> human time vs cpu time =P
23:54 < driftx> you could almost do something like #3452 and then have a jmx call to
say 'ok, finish'
23:54 < CassBotJr> : Create an
'infinite bootstrap' mode for sampling live traffic
23:54 < driftx> except the first one that tries is going to have every node pound it
with all the writes
23:54 < goffinet> driftx: ill make a jira ticket so we can discuss there, it doesn't
seem like it would be too much trouble to support this use case
23:54 < goffinet> we'd be happy to write the patch after some input
23:55 < driftx> trickier than it sounds I'll bet, but sgtm
23:57 < stuhood> alternatively, is now the right time to add back group bootstrap?
23:58 < stuhood> so you'd 1) add the dc to the strategy, 2) do a group bootstrap of
the entire dc
23:58 < stuhood> would also have to fix the CL.ONE problem though.
23:59 < goffinet> how did group bootstrap work again?
23:59 < driftx> #2434 is relevant
23:59 < CassBotJr> : range movements
can violate consistency
--- Day changed Fri Nov 11 2011
00:00 < stuhood> goffinet: bootstrapping many nodes at once without the 2 minute wait
00:01 < goffinet> why was it removed?
00:01 < stuhood> used zookeeper
00:01 < goffinet> oh.
00:01 < stuhood> but come to think of it, removing the 2 minute wait would seem to be
relatively easy
00:02 < goffinet> stuhood, i thought the 2 minute wait was just waiting for ring state
to settle?
00:02 < goffinet> before it streamed from nodes
00:02 < stuhood> goffinet: yea: you could form a "group" bootstrap by inverting things
and waiting until you -hadn't- seen a new node in 2-10 minutes before you chose a token and
started bootstrapping
00:03 < stuhood> so, not terribly simple, but.
00:04 < stuhood> you'd basically have a bunch of nodes sitting around waiting until
no new nodes started, and then they have to deterministically choose tokens.
00:05 < goffinet> yes
00:05 < stuhood> well, alternatively, you wouldn't need a new way to deterministically
choose tokens
00:05 < stuhood> (easier)
00:05 < stuhood> no… scratch that. you would need a way
00:05 < stuhood> for this DC case, all of the nodes are entering an empty ring
00:06 < stuhood> so the group would need to choose something balanced
00:06 < goffinet> empty ring?
00:06 < stuhood> yea, essentially… there are no tokens in that dc
00:06 < goffinet> but we were going to provide the tokens manually?
00:06 < goffinet> were you thinking of making it automatic?
00:07 < stuhood> yea. fixing bootstrapping groups of nodes would make automatic safe
00:08 < stuhood> so… whatever state a node is in when it is sitting and waiting for
enough information to choose a token, it should just stay that way and watch what other nodes
enter that state
00:08 < goffinet> so i have a question about the 120 second window you have to wait..
00:09 < stuhood> mm
00:09 < driftx> hmm, what if they started up at rf:0 but stayed in some dead state (hibernate
might work) without doing anything until you changed the rf, then actually bootstrapped?
00:09 < goffinet> so imagine i startup all the nodes in DC2 at same time, does join_ring=false
not grab gossip info at all? I was thinking it would be good if we could just start gossip
on all nodes, but until operator says 'go' then i could bootstrap them all at same time
00:09 < goffinet> since i would only have to wait at most 120 seconds before kicking
them all off
00:10 < stuhood> driftx: yea, that could work too… but you'd still need to choose
tokens. (also, the rf=0 thing shouldn't be necessary, right? that's the CL.ONE bug)
00:11 < driftx> well, you really want to choose tokens anyway
00:11 < stuhood> goffinet: it does get gossip… i think that's basically equivalent
to the pre-join state
00:11 < driftx> I guess you don't need rf=0 if all the nodes are in hibernate
00:12 < goffinet> yeah i think you do need hibernate in this case, because if i set
tokens upfront, i want all nodes to know about ATL ones too
00:12 < goffinet> before i kick off bootstrap
00:12 < stuhood> driftx: i'm confused… what is the difference between rf=0 and not
being there?
00:12 < stuhood> is that a workaround for the CL.ONE bug?
00:13 < driftx> you know there's a dc with rf:0, can add one with impacting anything
00:13 < driftx> err, without
00:14 ?? boaz__ (0819c319@gateway/web/freenode/ip. has joined #cassandra-dev
00:14 < stuhood> so what was the point of adding it? that's why i'm confused...
00:14 < goffinet> im fine with rf:0, its so you can add the nodes to the cluster before
calling repair
00:14 < goffinet> before you add nodes
00:15 < driftx> because the dc is in the schema
00:15 < driftx> so you need it there to have nodes be in it
00:15 < stuhood> ah
00:16 < goffinet> driftx: any reason why we couldnt just fix that? so dc2:3 wont throw
an error if nodes are down?
00:16 < goffinet> that way you would needed to do two steps
00:16 < goffinet> dc2:0, add nodes, dc2:3
00:16 < goffinet> wouldn't*
00:16 < driftx> I don't understand, you can already do that
00:17 < driftx> you just have to repair afterwards
00:17 < goffinet> it throws an error currently? if you set dc2:3 and no nodes exist
for dc2
00:17 < goffinet> we'll double check on that
00:18 < goffinet> for writes
00:18 < driftx> oh, it does
00:19 < driftx> but only for writes
00:19 < goffinet> yeah
00:19 < goffinet> so thats fine, thats fixable
00:19 < goffinet> im just curious about a) how can we bootstrap nodes without 120s delays
between N nodes b) stream from DC1 without AES
00:21 < stuhood> goffinet: if you figure out a, i don't think b is necessary?
00:22 < stuhood> assuming they are aware of the other joining nodes, and can all join
the same range
00:22 < stuhood> that would be the keystone for some kind of group bootstrap
00:23 < goffinet> let me test out join_ring, because im curious. if join_ring=false
still gossips but doesnt offically join.. it would be nice if node 2 in DC2 knew about that
node too somehow?
00:23 < driftx> that's why I proposed cheating, add them all as non-members, then ask
them to bootstrap
00:23 < goffinet> because then .. i could just run a command on each node at same time
00:23 < goffinet> since they all know about each other in a hibernate state
00:23 < goffinet> driftx: yes i like that
00:24 < driftx>     private void joinTokenRing(int delay) throws IOException, org.apache.cassandra.config.ConfigurationException
00:24 < driftx>     {
00:24 < driftx>"Starting up server gossip");
00:24 < driftx> they don't use gossip with join_ring off
00:24 < stuhood> but will that actually allow them to all join the same range?
00:24 < goffinet> okay cool, yeah we would need to make it join in that special state
00:25 < stuhood> i think there is an edgecase here… if multiple nodes are joining
the same range, and one of them fails, then should they all fail?
00:25 < driftx> no, it basically saves you server startup time that is not ring-related
00:25 < goffinet> stuhood, they all know the tokens ahead of time?
00:25 < goffinet> they just need to know the current global state of things
00:25 < stuhood> goffinet: right, but if they are streaming the range that they will
be responsible for...
00:26 ?? mw1 (~Adium@ has quit (Quit: Leaving.)
00:26 < stuhood> Joining nodes don't stick around if they fail
00:26 < goffinet> they shouldnt be allowed to do that until they joined ?
00:26 < stuhood> nah, you stream while you are joining… unless you are talking about
00:26 < goffinet> stuhood: was that removed? i thought u had to still remove the node
00:26 < goffinet> using the new options in 1.0
00:26 < stuhood> don't know about 1.0
00:27 < driftx> no, a failed non-member is just a fat client and disappears
00:27 < goffinet> but i thought there was a timeout for fat client ?
00:27 < goffinet> is it 30s or something?
00:27 < driftx> yes
00:28 < goffinet> so nodes that arent fat clients, why might we remove them ? if we
00:28 < goffinet> and let the operator do it
00:28 < goffinet> or have a larger timeout
00:28 < goffinet> might make this a non-issue
00:28 < driftx> what does a larger timeout/keeping them around buy you?
00:29 < goffinet> because if they go away, and i bootstrap after they failed, wont my
view of ring be skewed?
00:29 < stuhood> driftx: i guess in this case, the node would resume bootstrapping from
where it left off
00:29 < driftx> it would've missed writes in the meantime and require a repair afterwards
00:29 < stuhood> sorry… "resume" in the sense of "start over", but yea
00:31 < stuhood> that would be a pretty big change, but it might make sense
00:31 < goffinet> stuhood: what would you change
00:31 < stuhood> what you said, about nodes in joining staying in joining
00:31 < stuhood> so if the machine restarts, it begins joining at the same position
00:33 < goffinet> if we supported that + letting nodes gossip in hibernate, would allow
us to add capacity at operator control
> Support bringing up a new datacenter to existing cluster without repair
> -----------------------------------------------------------------------
>                 Key: CASSANDRA-3483
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.0.2
>            Reporter: Chris Goffinet
> Was talking to Brandon in irc, and we ran into a case where we want to bring up a new
DC to an existing cluster. He suggested from jbellis the way to do it currently was set strategy
options of dc2:0, then add the nodes. After the nodes are up, change the RF of dc2, and run
> I'd like to avoid a repair as it runs AES and is a bit more intense than how bootstrap
works currently by just streaming ranges from the SSTables. Would it be possible to improve
this functionality (adding a new DC to existing cluster) than the proposed method? We'd be
happy to do a patch if we got some input on the best way to go about it.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message