cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Goffinet (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair
Date Fri, 11 Nov 2011 05:44:51 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148280#comment-13148280
] 

Chris Goffinet commented on CASSANDRA-3483:
-------------------------------------------

Some discussion from irc:

{noformat}
23:43 < goffinet> has datastax ever had a customer add a new datacenter to an existing
cluster? No docs or info on web suggest anyone has done this before
23:44 < driftx> yeah
23:44 < goffinet> how is it done? we are running a case where if i modify strategy options
before adding nodes, writes will fail since no endpoints for DC have been added
23:44 < goffinet> we were expecting this might work because we want to bootstrap the
new DC to the existing cluster
23:44 < goffinet> take on writes + stream data with RF factor
23:45 < driftx> general best practice is (jbellis can correct if I'm outdated) add the
dc at rf:0, add the nodes/update snitch, repair
23:45 < driftx> err, update rf, repair
23:46 < goffinet> yeah mind if i open up a jira? that seems extreme to make the cluster
do that .. ?
23:46 < goffinet> or is repair smart enough to just stream ranges instead of AES?
23:46 < driftx> 'instead of AES?' that's what repair is, but if just streams ranges
23:46 < driftx> s/if/it/
23:47 < goffinet> right but AES builds merkle tree, scans through all data ?
23:47 < goffinet> isn't bootstrap a different operation?
23:47 < goffinet> when streaming just sstables
23:47 < driftx> yeah, it is
23:47 < goffinet> yeah thats more heavy. dont understand why we couldnt use that instead
23:47 < goffinet> like bootstrap
23:48 < stuhood> now that i think about it, it doesn't really make sense that a CL.ONE
write fails if a DC isn't available
23:48 < stuhood> independent of the bootstrap case, that sounds like the real issue
23:49 < stuhood> goffinet: ^
23:50 < driftx> hmm, yeah that doesn't
23:50 < driftx> but the problem with bootstrapping a dc is the first node you bootstrap
gets everything
23:50 < goffinet> stuhood: yeah. it was complaining about not enough endpoints 
23:50 < goffinet> driftx: why is that? if you are doubling the cluster, and assign the
tokens manually ?
23:51 < driftx> still have to do them 2 mins apart, and they're probably going to be
part of the same replica set which I think is troublesome too
23:51 < goffinet> driftx: maybe we can make repair a bit more intelligent? if no data
exists on the node .. just stream the ranges instead of using AES
23:52 < driftx> problem is we're pushing AES to do the entire replica set (which is
nearly does now)
23:52 < stuhood> goffinet: it shouldn't be as heavyweight as you're thinking
23:53 < goffinet> stuhood: but we have a way currently that is less heavy
23:53 < goffinet> i dont understand why we couldnt use that method
23:53 < stuhood> not implemented =)
23:53 < goffinet> don't cut corners :)
23:53 < stuhood> human time vs cpu time =P
23:54 < driftx> you could almost do something like #3452 and then have a jmx call to
say 'ok, finish'
23:54 < CassBotJr> https://issues.apache.org/jira/browse/CASSANDRA-3452 : Create an
'infinite bootstrap' mode for sampling live traffic
23:54 < driftx> except the first one that tries is going to have every node pound it
with all the writes
23:54 < goffinet> driftx: ill make a jira ticket so we can discuss there, it doesn't
seem like it would be too much trouble to support this use case
23:54 < goffinet> we'd be happy to write the patch after some input
23:55 < driftx> trickier than it sounds I'll bet, but sgtm
23:57 < stuhood> alternatively, is now the right time to add back group bootstrap?
23:58 < stuhood> so you'd 1) add the dc to the strategy, 2) do a group bootstrap of
the entire dc
23:58 < stuhood> would also have to fix the CL.ONE problem though.
23:59 < goffinet> how did group bootstrap work again?
23:59 < driftx> #2434 is relevant
23:59 < CassBotJr> https://issues.apache.org/jira/browse/CASSANDRA-2434 : range movements
can violate consistency
--- Day changed Fri Nov 11 2011
00:00 < stuhood> goffinet: bootstrapping many nodes at once without the 2 minute wait
00:01 < goffinet> why was it removed?
00:01 < stuhood> used zookeeper
00:01 < goffinet> oh.
00:01 < stuhood> but come to think of it, removing the 2 minute wait would seem to be
relatively easy
00:02 < goffinet> stuhood, i thought the 2 minute wait was just waiting for ring state
to settle?
00:02 < goffinet> before it streamed from nodes
00:02 < stuhood> goffinet: yea: you could form a "group" bootstrap by inverting things
and waiting until you -hadn't- seen a new node in 2-10 minutes before you chose a token and
started bootstrapping
00:03 < stuhood> so, not terribly simple, but.
00:04 < stuhood> you'd basically have a bunch of nodes sitting around waiting until
no new nodes started, and then they have to deterministically choose tokens.
00:05 < goffinet> yes
00:05 < stuhood> well, alternatively, you wouldn't need a new way to deterministically
choose tokens
00:05 < stuhood> (easier)
00:05 < stuhood> no… scratch that. you would need a way
00:05 < stuhood> for this DC case, all of the nodes are entering an empty ring
00:06 < stuhood> so the group would need to choose something balanced
00:06 < goffinet> empty ring?
00:06 < stuhood> yea, essentially… there are no tokens in that dc
00:06 < goffinet> but we were going to provide the tokens manually?
00:06 < goffinet> were you thinking of making it automatic?
00:07 < stuhood> yea. fixing bootstrapping groups of nodes would make automatic safe
again
00:08 < stuhood> so… whatever state a node is in when it is sitting and waiting for
enough information to choose a token, it should just stay that way and watch what other nodes
enter that state
00:08 < goffinet> so i have a question about the 120 second window you have to wait..
00:09 < stuhood> mm
00:09 < driftx> hmm, what if they started up at rf:0 but stayed in some dead state (hibernate
might work) without doing anything until you changed the rf, then actually bootstrapped?
00:09 < goffinet> so imagine i startup all the nodes in DC2 at same time, does join_ring=false
not grab gossip info at all? I was thinking it would be good if we could just start gossip
on all nodes, but until operator says 'go' then i could bootstrap them all at same time
00:09 < goffinet> since i would only have to wait at most 120 seconds before kicking
them all off
00:10 < stuhood> driftx: yea, that could work too… but you'd still need to choose
tokens. (also, the rf=0 thing shouldn't be necessary, right? that's the CL.ONE bug)
00:11 < driftx> well, you really want to choose tokens anyway
00:11 < stuhood> goffinet: it does get gossip… i think that's basically equivalent
to the pre-join state
00:11 < driftx> I guess you don't need rf=0 if all the nodes are in hibernate
00:12 < goffinet> yeah i think you do need hibernate in this case, because if i set
tokens upfront, i want all nodes to know about ATL ones too
00:12 < goffinet> before i kick off bootstrap
00:12 < stuhood> driftx: i'm confused… what is the difference between rf=0 and not
being there?
00:12 < stuhood> is that a workaround for the CL.ONE bug?
00:13 < driftx> you know there's a dc with rf:0, can add one with impacting anything
00:13 < driftx> err, without
00:14 ?? boaz__ (0819c319@gateway/web/freenode/ip.8.25.195.25) has joined #cassandra-dev
00:14 < stuhood> so what was the point of adding it? that's why i'm confused...
00:14 < goffinet> im fine with rf:0, its so you can add the nodes to the cluster before
calling repair
00:14 < goffinet> before you add nodes
00:15 < driftx> because the dc is in the schema
00:15 < driftx> so you need it there to have nodes be in it
00:15 < stuhood> ah
00:16 < goffinet> driftx: any reason why we couldnt just fix that? so dc2:3 wont throw
an error if nodes are down?
00:16 < goffinet> that way you would needed to do two steps
00:16 < goffinet> dc2:0, add nodes, dc2:3
00:16 < goffinet> wouldn't*
00:16 < driftx> I don't understand, you can already do that
00:17 < driftx> you just have to repair afterwards
00:17 < goffinet> it throws an error currently? if you set dc2:3 and no nodes exist
for dc2
00:17 < goffinet> we'll double check on that
00:18 < goffinet> for writes
00:18 < driftx> oh, it does
00:19 < driftx> but only for writes
00:19 < goffinet> yeah
00:19 < goffinet> so thats fine, thats fixable
00:19 < goffinet> im just curious about a) how can we bootstrap nodes without 120s delays
between N nodes b) stream from DC1 without AES
00:21 < stuhood> goffinet: if you figure out a, i don't think b is necessary?
00:22 < stuhood> assuming they are aware of the other joining nodes, and can all join
the same range
00:22 < stuhood> that would be the keystone for some kind of group bootstrap
00:23 < goffinet> let me test out join_ring, because im curious. if join_ring=false
still gossips but doesnt offically join.. it would be nice if node 2 in DC2 knew about that
node too somehow?
00:23 < driftx> that's why I proposed cheating, add them all as non-members, then ask
them to bootstrap
00:23 < goffinet> because then .. i could just run a command on each node at same time
00:23 < goffinet> since they all know about each other in a hibernate state
00:23 < goffinet> driftx: yes i like that
00:24 < driftx>     private void joinTokenRing(int delay) throws IOException, org.apache.cassandra.config.ConfigurationException
00:24 < driftx>     {
00:24 < driftx>         logger_.info("Starting up server gossip");
00:24 < driftx> they don't use gossip with join_ring off
00:24 < stuhood> but will that actually allow them to all join the same range?
00:24 < goffinet> okay cool, yeah we would need to make it join in that special state
then
00:25 < stuhood> i think there is an edgecase here… if multiple nodes are joining
the same range, and one of them fails, then should they all fail?
00:25 < driftx> no, it basically saves you server startup time that is not ring-related
:)
00:25 < goffinet> stuhood, they all know the tokens ahead of time?
00:25 < goffinet> they just need to know the current global state of things
00:25 < stuhood> goffinet: right, but if they are streaming the range that they will
be responsible for...
00:26 ?? mw1 (~Adium@8.25.195.29) has quit (Quit: Leaving.)
00:26 < stuhood> Joining nodes don't stick around if they fail
00:26 < goffinet> they shouldnt be allowed to do that until they joined ?
00:26 < stuhood> nah, you stream while you are joining… unless you are talking about
repair
00:26 < goffinet> stuhood: was that removed? i thought u had to still remove the node
00:26 < goffinet> using the new options in 1.0
00:26 < stuhood> don't know about 1.0
00:27 < driftx> no, a failed non-member is just a fat client and disappears
00:27 < goffinet> but i thought there was a timeout for fat client ?
00:27 < goffinet> is it 30s or something?
00:27 < driftx> yes
00:28 < goffinet> so nodes that arent fat clients, why might we remove them ? if we
didnt..
00:28 < goffinet> and let the operator do it
00:28 < goffinet> or have a larger timeout
00:28 < goffinet> might make this a non-issue
00:28 < driftx> what does a larger timeout/keeping them around buy you?
00:29 < goffinet> because if they go away, and i bootstrap after they failed, wont my
view of ring be skewed?
00:29 < stuhood> driftx: i guess in this case, the node would resume bootstrapping from
where it left off
00:29 < driftx> it would've missed writes in the meantime and require a repair afterwards
anyway
00:29 < stuhood> sorry… "resume" in the sense of "start over", but yea
00:31 < stuhood> that would be a pretty big change, but it might make sense
00:31 < goffinet> stuhood: what would you change
00:31 < stuhood> what you said, about nodes in joining staying in joining
00:31 < stuhood> so if the machine restarts, it begins joining at the same position
again
00:33 < goffinet> if we supported that + letting nodes gossip in hibernate, would allow
us to add capacity at operator control
{noformat}
                
> Support bringing up a new datacenter to existing cluster without repair
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-3483
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3483
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.0.2
>            Reporter: Chris Goffinet
>
> Was talking to Brandon in irc, and we ran into a case where we want to bring up a new
DC to an existing cluster. He suggested from jbellis the way to do it currently was set strategy
options of dc2:0, then add the nodes. After the nodes are up, change the RF of dc2, and run
repair. 
> I'd like to avoid a repair as it runs AES and is a bit more intense than how bootstrap
works currently by just streaming ranges from the SSTables. Would it be possible to improve
this functionality (adding a new DC to existing cluster) than the proposed method? We'd be
happy to do a patch if we got some input on the best way to go about it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message