We need to do a rolling upgrade of our Cassandra cluster in production, since we are upgrading Cassandra on solaris to Cassandra on CentOS.
(We went with solaris initially since most of our other hosts in production are solaris, but were running into some lockup issues during perf tests, and decided to switch to linux)
Here are the steps we are following to take the node out of service, and get it back. Can someone comment if we are missing anything (eg. is it recommended to specify tokens in cassandra.yaml, or do something different with the seed hosts than mentioned below)
1. nodetool decommission – wait for the data to be streamed out.
2. Re-image (everything is wiped off the disks) the host to CentOS, with the same Cassandra version
3. Get Cassandra back up.
- Using Cassandra 1.1.5
- We do not specify any tokens in cassandra.yaml relying on bootstrap assigning the tokens automatically.
- We are testing with a 4 node cluster, with only one seed host. The seed host is specified in the cassandra.yaml of each node and is not changed at any point.
While testing the solaris to linux upgrade path, things seem to work smoothly. The data streams out fine, and streams back in when the node comes back up. However, testing the linux to solaris path (in case we need to rollback), we are facing some issues with the nodes joining back the ring. nodetool indicates that the node has joined back the ring, but no data streams in, the node doesn’t know about the keyspaces/column families, etc. We see some errors in the logs of the newly added nodes pasted below.
[17/06/2013:14:10:17 PDT] MutationStage:1: ERROR RowMutationVerbHandler.java (line 61) Error in row mutation
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=1020