Hello,

We recently experienced (pretty severe) data loss after moving our 4 node Cassandra cluster from one EC2 availability zone to another.  Our strategy for doing so was as follows:
Everything seemed to work as expected.  As we decommissioned each node, we checked the logs for messages indicating "yes, this node is done decommissioning" before turning the node off.

Pretty quickly after the old nodes left the cluster, we started getting client calls about data missing.

We immediately turned the old nodes back on and when they rejoined the cluster *most* of the reported missing data returned.  For the rest of the missing data, we had to spin up a new cluster from EBS snapshots and copy it over.

What did we do wrong?

In hindsight, we noticed a few things which may be clues...
Here's more info about our cluster...
  • Cassandra 1.2.10
  • Replication factor of 3
  • Vnodes with 256 tokens
  • All tables made via CQL
  • Data dirs on EBS (yes, we are aware of the performance implications)

Thanks for the help.