incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: HTimedOutException and cluster not working
Date Tue, 18 Sep 2012 11:36:59 GMT
What version are you on ?

>  HTimedOutException is logged for all the nodes. 
TimedOutException happens when less than CL replica nodes respond to the coordinator in time.

You could get the error from all nodes in your cluster if the 3 nodes that store the key are
having problems. 

> MutationStage 16 2177067 879092633 0 0
This looks like mutations are blocked or running very very slowly. 

> FlushWriter 0 0 5616 0 1321
The All Timed Blocked number means there were 1,321 times a thread tried to flush a memtable
but the queue of flushers was full. Do you use secondary indexes ? If so take a look at the
comments for memtable_flush_queue_size in the yaml file. 

>   and cluster settings, it should be possible in this scenario, write success 
>   on one of the nodes even though node-3 is too busy or failing for any reason? 
Yes. 
If only one node fails to respond the write should succeed. If you got a TimedOut with CL
ONE it sounds like more nodes were having problems. 

> * when hector client failover to other nodes, basically all the nodes fail, why
>   is this so?
Sorry I don't understand this question. 

> * what factors that increase MutationStage active and pending values?
Check the log for ERRORs
Check for failing or overloaded IO. 
See comment above about memtable flush queue size. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/09/2012, at 4:24 PM, Jason Wee <peichieh@gmail.com> wrote:

> Hello,
> 
> A context to our environment, we have a clusters of 9 nodes with a few keyspaces. The
client write to the cluster with consistency level of one to a keyspace in the cluster with
a replication factor of 3. The hector client is configured such that all the nodes in cluster
is specified and so that we would want to ensure that at any write request, two nodes, can
fail and one write is succcess to the cluster node.
> 
> However, under certain situation, we seen in the log, HTimedOutException is logged during
writing to the cluster. Hector client thus failover to the next node in the cluster but what
we noticed is that, the same exception, HTimedOutException is logged for all the nodes. This
result that the cluster is not working as a whole. Logically, we checked all the nodes in
the cluster for load. Only node-3 seem to have high pending MutationStage when nodetool tpstats
is run. Other nodes are fine with 0 active and 0 pending for all the stages. 
> 
> /nodetool -h localhost tpstats
> Pool Name Active Pending Completed Blocked All time blocked
> ReadStage 0 0 11116983 0 0
> RequestResponseStage 0 0 1252368951 0 0
> MutationStage 16 2177067 879092633 0 0
> ReadRepairStage 0 0 3648106 0 0
> ReplicateOnWriteStage 0 0 33722610 0 0
> GossipStage 0 0 20504608 0 0
> AntiEntropyStage 0 0 1197 0 0
> MigrationStage 0 0 89 0 0
> MemtablePostFlusher 0 0 5659 0 0
> StreamStage 0 0 296 0 0
> FlushWriter 0 0 5616 0 1321
> MiscStage 0 0 5964 0 0
> AntiEntropySessions 0 0 88 0 0
> InternalResponseStage 0 0 27 0 0
> HintedHandoff 1 2 5976 0 0
> 
> Message type Dropped
> RANGE_SLICE 0
> READ_REPAIR 0
> BINARY 0
> READ 178
> MUTATION 17467
> REQUEST_RESPONSE 0
> 
> We proceed to check if there is any compaction in node-3 and found out the 
> following:
> 
> ./nodetool -hlocalhost compactionstats
> pending tasks: 196
> compaction type keyspace column family bytes compacted bytes total progress
> Cleanup MyKeyspace MyCF 6946398685 10230720119 67.90%
> 
> 
> Question:
> * with a replication factor of 3 in the keyspace and client write consistency 
>   level of one, in the situation above, and the current hector client settings 
>   and cluster settings, it should be possible in this scenario, write success 
>   on one of the nodes even though node-3 is too busy or failing for any reason? 
>   
> * when hector client failover to other nodes, basically all the nodes fail, why
>   is this so?
>   
> * what factors that increase MutationStage active and pending values?
> 
> Thank you for any comments and insight
> 
> Regards,
> Jason


Mime
View raw message