Don't run with a replication factor of 2, use 3 instead, and do all reads and writes using quorum consistency.

That way, if a single node is down, all your operations will complete. In fact, if every third node is down, you'll still be fine and able to handle all requests.

However, if two adjacent nodes are down at the same time, operations against keys that are stored on both those servers will fail beause quorum can't be satisfied.

To gain a better understanding, repeat your tests, but with multiple random keys, and keep track of how many operations fail in each case.


/Henrik

On Thu, Dec 20, 2012 at 10:26 AM, Vasileios Vlachos <vasileiosvlachos@gmail.com> wrote:
Furthermore, with regards to being "unlucky" with the "wrong node" if
this actually what is happening, how is it possible to ever have a
node-failure resiliant cassandra cluster? My understanding of this
implies that even with 100 nodes, every 1/100 writes would fail until
the node is replaced/repaired.