From E S <>
Subject Odd Node Behavior
Date Mon, 14 May 2012 13:00:43 GMT

I am having some very strange issues with a cassandra setup.  I recognize that this is not
the ideal cluster setup, but I'd still like to try and understand what is going wrong.

The cluster has 3 machines (A,B,C) running Cassandra 1.0.9 with JNA.  A & B are in datacenter1
while C is in datacenter2.  Cassandra knows about the different datacenter because of the
rack inferred snitch.  However, we are currently using a simple placement strategy on the
keyspace.  All reads and writes are done with quorum.  Hinted handoffs are enabled.  Most
the the cassandra settings are at their defaults, with the exception of thrift message sizes,
which we have upped to 256 mb (while very rare, we can sometimes have a few larger rows so
wanted a big buffer).  There is a firewall between the two datacenters.  We have enabled
TCP traffic for the thrift and storage ports (but not JMX, and no UDP)

Another odd thing is that there are actually 2 cassandra clusters hosted on these machines
(although with the same setup).  Each machine has 2 cassandra processes, but everything is
running on different ports and different cluster names.

On one of the two clusters we were doing some failover testing.  We would take nodes down
quickly in succession and make sure sure the system remained up.

Most of the time, we got a few timeouts on the failover (unexpected, but not the end of the
world) and then quickly recovered; however, twice we were able to put the cluster in an unusable
state.  We found that sometimes node C, while seemingly up (no load, and marked as UP in
the ring by other nodes), was unresponsive to B (when A was down) when B was coordinating
a quorum write.  We see B making a request in the logs (on debug) and 10 seconds later timing
out.  We see nothing happening in C's log (also debug).  The box is just idling.  In retrospect,
I should have put it in trace (will do this next time).  We had it come back after 30 minutes
once.  Another time, it came back earlier after cycling it.

I also noticed a few other crazy log messages on C in that time period.  There were two instances
of "invalid protocol header", which in code seems to only happen when PROTOCOL_MAGIC doesn't
match (, which seems like an impossible state.

I'm currently at a loss trying to explain what is going on.  Has anyone seen anything like
this?  I'd appreciate any additional debugging ideas!  Thanks for any help.


