cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Easy way to overload a single node on purpose?
Date Thu, 16 Jun 2011 11:16:22 GMT
>     DEBUG 14:36:55,546 ... timed out

Is logged when the coordinator times out waiting for the replicas to respond, the timeout
setting is rpc_timeout in the yaml file. This results in the client getting a TimedOutException.


AFAIK There is no global everything is good / bad flags to check. e.g. AFAIK I node will not
mark its self down if it runs out of disk space.  So you need to monitor the free disk space
and alert on that. 

Having a ping column can work if every key is replicated to every node. It would tell you
the cluster is working, sort of. Once the number of nodes is greater than the RF, it tells
you a subset of the nodes works. 

If you google around you'll find discussions about monitoring with munin, ganglia, cloud kick
and Ops Centre. 

If you install mx4j you can access the JMX metrics via HTTP,

Cheers
      
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16 Jun 2011, at 10:38, Suan Aik Yeo wrote:

> Here's a weird one... what's the best way to get a Cassandra node into a "half-crashed"
state?
> 
> We have a 3-node cluster running 0.7.5. A few days ago this happened organically to node1
- the partition the commitlog was on was 100% full and there was a "No space left on device"
error, and after a while, although the cluster and node1 was still up, to the other nodes
it was down, and messages like:
>     DEBUG 14:36:55,546 ... timed out
> started to show up in its debug logs.
> 
> We have a tool to indicate to the load balancer that a Cassandra node is down, but it
didn't detect it that time. Now I'm having trouble purposefully getting the node back to that
state, so that I can try other monitoring methods. I've tried to fill up the commitlog partition
with other files, and although I get the "No space left on device" error, the node still doesn't
go down and show the other symptoms it showed before.
> 
> Also, if anyone could recommend a good way for a node itself to detect that its in such
a state I'd be interested in that too. Currently what we're doing is making a "describe_cluster_name()"
thrift call, but that still worked when the node was "down". I'm thinking of something like
reading/writing to a fixed value in a keyspace as a check... Unfortunately Java-based solutions
are out of the question.
> 
> 
> Thanks,
> Suan


Mime
View raw message