Hi, all -
We've been running Cassandra 1.1.12 in production since February, and have experienced a vexing problem with an arbitrary node "falling out of" or separating from the ring on occasion.
When a node "falls out" of the ring, running nodetool ring on the misbehaving node shows that the misbehaving node believes that is Up, but that the rest of the ring is Down, and the rest of the ring has question marks listed for load. nodetool ring on any of the other nodes, however, shows the misbehaving node as Down but everything else is up.
Shutting down and restarting the misbehaving node does not result in changed behavior. We can only get the misbehaving node to rejoin the ring by shutting it down, running nodetool removetoken <misbehaving node token> and nodetool removetoken force elsewhere in the ring. After the node's token has been removed from the ring, it will rejoin and behave normally when it is restarted.
This is not a frequent occurrence - we can go months between this happening. It most commonly occurs when a different node is brought down and then back up, but it can happen spontaneously. This is also not associated with a network connectivity event; we've seen no interruption in the nodes being able to communicate over the network. As above, it's also not isolated to a single node; we've seen this behavior on multiple nodes.
This has occurred with both the identical seeds specified in cassandra.yaml on each node, and also when we remove the node from its own seed list (so any seed won't try to auto-bootstrap from itself). Seeds have always been up and available.
Has anyone else seen similar behavior? For obvious reasons, we hate seeing one of the nodes suddenly "fall out" and require intervention when we flap another node, or for no reason at all.