If I had to do a repair after upping the RF, than that is probably what caused the data loss. Wish I had been more careful.

I'm guessing the data is irrevocably lost, I didn't make any any snapshots.

Would it be possible to figure out if only a certain part of the ring was effected? That would be helpful in figuring out what data was lost.

I've done a full repair now, so I'm also guessing that inconsistent data is now completely gone as well, right?

On Sunday, June 9, 2013 at 10:37 AM, Edward Capriolo wrote:

Sounds like your cluster got shufflef*cked.
You said : "After we had gotten all the data moved over we decided to add 2 more nodes, as well as up the RF to 2."

After your raise replication you have to run repair on all nodes. If you did not, and then you proceeded to shuffle you will likely have a data loss.

If you did repair all nodes before the shuffle, I do not know then the shuffle must have went wrong. If your reading at CL.ALL and still seeing inconsistencies that is bad. Possible try raising the read repair chance to 100% and continue reading and see if the data becomes consistent (though I do not know why repair would not do it).

On Sat, Jun 8, 2013 at 8:56 PM, Nimi Wariboko Jr <nimiwaribokoj@gmail.com> wrote:

We are seeing an issue where data that was written to the cluster is no longer accessible after trying to expand the size of the cluster. I will try and provide as much information as possible, I am just starting at with Cassandra and I'm not entirely sure what data is relevant.

All Cassandra nodes are 1.2.5, and each node has the same config. 

We started out moving our entire data set to a single cassandra node. This node was initially set up with Initial Token : 0, as well as other default settings. After we had gotten all the data moved over we decided to add 2 more nodes, as well as up the RF to 2. We also decided to start using vnodes which meant setting num_tokens to 256 and removing the initial token param. We then decided to run cassandra-shuffle as well.

During cassandra-shuffle we started to notice some rows were disappearing then reappearing, and other rows haven't come back at all. I decided to stop the shuffle and repair each node then restart the cluster, however all the data hasn't come back. Note that this is CONSISTENCY ALL

Here is my `nodetool status` What is weird here is the token distribution 260-239-1. I'm not an expert but I believe it should be 256-256-256, or at least add up to 768.

Datacenter: 129
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns   Host ID                               Rack
UN  371.56 GB  260     38.1%  cde6c3be-a066-47f2-abc2-b1d78bee0d7c  196
UN  212.64 GB  239     61.5%  2cb24510-2f89-46b2-96b9-873f8e8e50da  196
UN  256.05 GB  1       0.4%   ce8d4ea9-8106-44b3-a2dd-c0230eb53c94  196

And here is the opscenter ring view (http://imgur.com/VssmFlw)

What also weird is the token count from nodetool -h [host] info differs from status. 

root@cass1:~# nodetool -h cass1 info | grep Token
Token            : (invoke with -T/--tokens to see all 239 tokens)
root@cass1:~# nodetool -h cass2 info | grep Token
Token            : (invoke with -T/--tokens to see all 269 tokens)
root@cass1:~# nodetool -h cass3 info | grep Token
Token            : (invoke with -T/--tokens to see all 260 tokens)

I believe it has something to do with the cluster not "seeing" all the tokens, but I am not sure where to continue from here. I don't believe any data was lost there was no power outage, and all the data should have been committed to disk before we added the two other nodes.