incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ian douglas <...@armorgames.com>
Subject Re: Working backwards from production to staging/dev
Date Fri, 25 Mar 2011 18:11:59 GMT
On 03/25/2011 10:12 AM, Jonathan Ellis wrote:
> On Fri, Mar 25, 2011 at 11:59 AM, ian douglas<ian@armorgames.com>  wrote:
>> (we're running v0.60)
> I don't know if you could hear that from where you are, but our whole
> office just yelled, "WTF!" :)

Ah, that's what that noise was... And yeah, we know we're way behind. 
Our initial delay in upgrading was waiting for 0.7 to come out and then 
we learned we needed a whole new Thrift client for our PHP code base, 
and then we got busy on other things, but we're at a point where we have 
some time to take care of Cassandra and get it upgraded.

  Our planned path, now, is:

(our nodes' tokens are numbered using the python code (0, 1/3 and 2/3 
times 2^127), and called node 1 through 3, respectively; our RF is set 
to 2 right now)

1. remove node 1 from our software
2. bring node 1 offline after a flush/repair/cleanup
3. run a cleanup on node 2 and then on node 3 so they have a full copy 
of all data from the old node 1 and each other.
4. bring up a new Large 64-bit instance, install 0.6.12, assign a Token 
value of 0 (node 1), RF:2, on a new gossip ring, and copy all data from 
the 32-bit nodes 2 and 3 and run a repair/cleanup to remove any 
duplicated data
5. remove node 3 from our software
6. point our code to the new 64-bit node 1
7. bring node 3 offline after a flush/repair/cleanup so node 2 has the 
last fresh copy of everything
8. bring node 2 offline after a flush/repair/cleanup
9. bring up another Large instance, get a copy of all data from our old 
node 2, assign a Token value of (1/2 * 2^127), RF:2, on the new gossip 
ring, run a repair to remove duplicate data, and then a cleanup so it 
gets replicated data from the new node 1
10. add the new node 2 to our software
11. run a final cleanup on the new node 1 and then on node 2 to make 
sure all data is replicated evenly on both nodes

... at this point, we should have two 64-bit Large instances, with RF:2, 
on a new gossip ring, replacing three 32-bit systems, with minimal down 
time and no data loss (just a data delay between steps 6 and 10 above).

Questions:
1. Does it appear that we've missed any steps, or doing something out of 
order?
2. Is the flush/repair/cleanup overkill when bringing the old nodes 
offline, or is that the correct sequence to follow?
3. Will the difference in compute units (lower on Large instances than 
Medium instances) make any noticeable difference, or will the fact that 
the machine is 64-bit handle things efficiently enough such that a Large 
instance works harder than a Medium instance? (never did figure out 
their how their compute units work)
4. Can we follow similar steps when we're ready to upgrade to 0.7x and 
have our new Thrift client for PHP all squared away?


Thanks again for the help!!!


Mime
View raw message