cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremiah D Jordan <>
Subject Re: Cluster Maintenance Mishap
Date Fri, 21 Oct 2016 04:18:39 GMT
The easiest way to figure out what happened is to examine the system log.  It will tell you
what happened.  But I’m pretty sure your nodes got new tokens during that time.

If you want to get back the data inserted during the 2 hours you could use sstableloader to
send all the data from the /var/data/cassandra_new/cassandra/* folders back into the cluster
if you still have it.


> On Oct 20, 2016, at 3:58 PM, Branton Davis <> wrote:
> Howdy folks.  I asked some about this in IRC yesterday, but we're looking to hopefully
confirm a couple of things for our sanity.
> Yesterday, I was performing an operation on a 21-node cluster (vnodes, replication factor
3, NetworkTopologyStrategy, and the nodes are balanced across 3 AZs on AWS EC2).  The plan
was to swap each node's existing 1TB volume (where all cassandra data, including the commitlog,
is stored) with a 2TB volume.  The plan for each node (one at a time) was basically:
> rsync while the node is live (repeated until there were only minor differences from new
> stop cassandra on the node
> rsync again
> replace the old volume with the new
> start cassandra
> However, there was a bug in the rsync command.  Instead of copying the contents of /var/data/cassandra
to /var/data/cassandra_new, it copied it to /var/data/cassandra_new/cassandra.  So, when cassandra
was started after the volume swap, there was some behavior that was similar to bootstrapping
a new node (data started streaming in from other nodes).  But there was also some behavior
that was similar to a node replacement (nodetool status showed the same IP address, but a
different host ID).  This happened with 3 nodes (one from each AZ).  The nodes had received
1.4GB, 1.2GB, and 0.6GB of data (whereas the normal load for a node is around 500-600GB).
> The cluster was in this state for about 2 hours, at which point cassandra was stopped
on them.  Later, I moved the data from the original volumes back into place (so, should be
the original state before the operation) and started cassandra back up.
> Finally, the questions.  We've accepted the potential loss of new data within the two
hours, but our primary concern now is what was happening with the bootstrapping nodes.  Would
they have taken on the token ranges of the original nodes or acted like new nodes and got
new token ranges?  If the latter, is it possible that any data moved from the healthy nodes
to the "new" nodes or would restarting them with the original data (and repairing) put the
cluster's token ranges back into a normal state?
> Hopefully that was all clear.  Thanks in advance for any info!

View raw message