cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Odd Node Behavior
Date Mon, 14 May 2012 21:10:31 GMT
> Most of the time, we got a few timeouts on the failover (unexpected, but not the end of
the world) and then quickly recovered; 
For read or write requests ? I'm guessing with 3 nodes you are using RF 3. In cassandra 1.x
the read repair chance is only 10%, so 90% of the time only CL nodes are involved in a read
request. If one of the nodes involved dies during the request the coordinator will time out
waiting. 
 
> We see B making a request in the logs (on debug) and 10 seconds later timing out.  We
see nothing happening in C's log (also debug).  
What were the log messages from the nodes ? In particular the ones from the StorageProxy on
Node B and RowMutationVerbHandler on node C.

> In retrospect, I should have put it in trace (will do this next time)
TRACE logs a lot of stuff. I'd hold off on that.  

> I also noticed a few other crazy log messages on C in that time period. 
What were the log messages ? 

>  There were two instances of "invalid protocol header", which in code seems to only happen
when PROTOCOL_MAGIC doesn't match (MessagingService.java), which seems like an impossible
state.
Often means something other than Cassandra connected on the port. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 15/05/2012, at 1:00 AM, E S wrote:

> Hello,
> 
> I am having some very strange issues with a cassandra setup.  I recognize that this is
not the ideal cluster setup, but I'd still like to try and understand what is going wrong.
> 
> The cluster has 3 machines (A,B,C) running Cassandra 1.0.9 with JNA.  A & B are in
datacenter1 while C is in datacenter2.  Cassandra knows about the different datacenter because
of the rack inferred snitch.  However, we are currently using a simple placement strategy
on the keyspace.  All reads and writes are done with quorum.  Hinted handoffs are enabled.
 Most the the cassandra settings are at their defaults, with the exception of thrift message
sizes, which we have upped to 256 mb (while very rare, we can sometimes have a few larger
rows so wanted a big buffer).  There is a firewall between the two datacenters.  We have enabled
TCP traffic for the thrift and storage ports (but not JMX, and no UDP)
> 
> Another odd thing is that there are actually 2 cassandra clusters hosted on these machines
(although with the same setup).  Each machine has 2 cassandra processes, but everything is
running on different ports and different cluster names.
> 
> On one of the two clusters we were doing some failover testing.  We would take nodes
down quickly in succession and make sure sure the system remained up.
> 
> Most of the time, we got a few timeouts on the failover (unexpected, but not the end
of the world) and then quickly recovered; however, twice we were able to put the cluster in
an unusable state.  We found that sometimes node C, while seemingly up (no load, and marked
as UP in the ring by other nodes), was unresponsive to B (when A was down) when B was coordinating
a quorum write.  We see B making a request in the logs (on debug) and 10 seconds later timing
out.  We see nothing happening in C's log (also debug).  The box is just idling.  In retrospect,
I should have put it in trace (will do this next time).  We had it come back after 30 minutes
once.  Another time, it came back earlier after cycling it.
> 
> I also noticed a few other crazy log messages on C in that time period.  There were two
instances of "invalid protocol header", which in code seems to only happen when PROTOCOL_MAGIC
doesn't match (MessagingService.java), which seems like an impossible state.
> 
> I'm currently at a loss trying to explain what is going on.  Has anyone seen anything
like this?  I'd appreciate any additional debugging ideas!  Thanks for any help.
> 
> Regards,
> Eddie  
> 


Mime
View raw message