incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: gossip marking all nodes as down when decommissioning one node.
Date Mon, 28 Oct 2013 07:16:28 GMT
>  (2 nodes in each availability zone)
How many AZ’s ? 

> The ec2 instances are m1.large 
I strongly recommend using m1.xlarge with ephemeral disks or a higher spec machine.  m1.large
is not up to the task.

> Why on earth is the decommissioning of one node causing all the nodes to be marked down?
decommissioning a node causes it to stream it’s data to the remaining nodes, which results
in them performing compaction. I would guess the low power m1.large nodes could not handle
the incoming traffic and compaction. This probably resulted in GC problems (check the logs),
which causes them to be marked as down. 

> 1) If we set the phi_convict_threshold to 12 or higher the nodes never get marked down.
12 is a good number on aws. 

> 2) or If we set the vnodes to 16 or lower we never see them get marked down.
I would leave this at 256. 
The less vnodes may result in slightly less overhead in repair, but the ultimate cause is
the choice of HW. 

> Is either of these solutions dangerous or better than the other?
Change the phi and move to m1.xlarge by doing a lift-and-shift. Stop one node at a time and
copy all it’s data and config to a new node. 

> The ultimate cause of the problem appears to be that the calculatePendingRanges in StorageService.java
is an extremely expensive proces

We don’t see issues like this other than on low powered nodes. 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 26/10/2013, at 6:14 am, John Pyeatt <john.pyeatt@singlewire.com> wrote:

> We are running a 6-node cluster in amazon cloud (2 nodes in each availability zone).
The ec2 instances are m1.large and we have 256 vnodes on each node.
> 
> We are using Ec2Snitch, NetworkTopologyStrategy and a replication factor of 3.
> 
> When we decommission one node suddenly reads and writes start to fail. We are seeing
Not Enough Replicas error messages which doesn't make sense even though we are doing QUORUM
reads/writes because there should still be 2 copies of each piece of data in the cluster.
> 
> Digging deep in the logs we see that the phi_convict_threshold is being exceeded so all
nodes in the cluster are being marked down for a period of approximately 10 seconds.
> 
> Why on earth is the decommissioning of one node causing all the nodes to be marked down?
> 
> We have two ways to work around this, though we think we have found the ultimate cause
of the problem.
> 1) If we set the phi_convict_threshold to 12 or higher the nodes never get marked down.
> 2) or If we set the vnodes to 16 or lower we never see them get marked down.
> 
> Is either of these solutions dangerous or better than the other?
> 
> 
> The ultimate cause of the problem appears to be that the calculatePendingRanges in StorageService.java
is an extremely expensive process and is running in the same thread pool (GossipTasks) as
the Gossiper.java code. calculatePendingRanges() runs during state changes of nodes (ex. decommissioning).
During this time it appears that it is hogging the one thread in the GossipTasks thread pool
thus causing things to get marked down from FailureDetector.java.
> 
> 
> 
> -- 
> John Pyeatt
> Singlewire Software, LLC
> www.singlewire.com
> ------------------
> 608.661.1184
> john.pyeatt@singlewire.com


Mime
View raw message