cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Yudovin <vla...@winguzone.com>
Subject Re: failing bootstraps with OOM
Date Wed, 02 Nov 2016 18:34:33 GMT
Hi,



probably you can try to start new node with auto_bootstrap: false and then repair keypaces
or even tables one by one with nodetool repair 



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





---- On Wed, 02 Nov 2016 10:35:45 -0400Mike Torra &lt;mtorra@demandware.com&gt; wrote
----




Hi All -



I am trying to bootstrap a replacement node in a cluster, but it consistently fails to bootstrap
because of OOM exceptions. For almost a week I've been going through cycles of bootstrapping,
finding errors, then restarting / resuming bootstrap, and I am struggling to move forward.
Sometimes the bootstrapping node itself fails, which usually manifests first as very high
GC times (sometimes 30s+!), then nodetool commands start to fail with timeouts, then the node
will crash with an OOM exception. Other times, a node streaming data to this bootstrapping
node will have a similar failure. In either case, when it happens I need to restart the crashed
node, then resume the bootstrap.



On top of these issues, when I do need to restart a node it takes a loooong time (http://stackoverflow.com/questions/40141739/why-does-cassandra-sometimes-take-a-hours-to-start).
This exasperates the problem because it takes so long to find out if a change to the cluster
helps or if it still fails. I am in the process of upgrading all nodes in the cluster from
m4.xlarge to c4.4xlarge, and I am running Cassandra DDC 3.5 on all nodes. The cluster has
26 nodes spread across 4 regions in EC2. Here is some other relevant cluster info (also in
stack overflow post):



Cluster Info

Cassandra DDC 3.5

EC2MultiRegionSnitch

m4.xlarge, moving to c4.4xlarge

Schema Info

3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default)

no secondary indexes

I am unsure what to try next. The node that is currently having this bootstrap problem is
a pretty beefy box, with 16 cores, 30G of ram, and a 3.2T EBS volume. The slow startup time
might be because of the issues with a high number of SSTables that Jeff Jirsa mentioned in
a comment on the SO post, but I am at a loss for the OOM issues. I've tried:


Changing from CMS to G1 GC, which seemed to have helped a bit

Upgrading from 3.5 to 3.9, which did not seem to help

Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to help, but I'm still
having issues

I'd appreciate any suggestions on what else I can try to track down the cause of these OOM
exceptions.



- Mike







Mime
View raw message