From Mike Torra <>
Subject failing bootstraps with OOM
Date Wed, 02 Nov 2016 14:35:45 GMT
Hi All -

I am trying to bootstrap a replacement node in a cluster, but it consistently fails to bootstrap
because of OOM exceptions. For almost a week I've been going through cycles of bootstrapping,
finding errors, then restarting / resuming bootstrap, and I am struggling to move forward.
Sometimes the bootstrapping node itself fails, which usually manifests first as very high
GC times (sometimes 30s+!), then nodetool commands start to fail with timeouts, then the node
will crash with an OOM exception. Other times, a node streaming data to this bootstrapping
node will have a similar failure. In either case, when it happens I need to restart the crashed
node, then resume the bootstrap.

On top of these issues, when I do need to restart a node it takes a loooong time (
This exasperates the problem because it takes so long to find out if a change to the cluster
helps or if it still fails. I am in the process of upgrading all nodes in the cluster from
m4.xlarge to c4.4xlarge, and I am running Cassandra DDC 3.5 on all nodes. The cluster has
26 nodes spread across 4 regions in EC2. Here is some other relevant cluster info (also in
stack overflow post):

Cluster Info

  *   Cassandra DDC 3.5
  *   EC2MultiRegionSnitch
  *   m4.xlarge, moving to c4.4xlarge

Schema Info

  *   3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default)
  *   no secondary indexes

I am unsure what to try next. The node that is currently having this bootstrap problem is
a pretty beefy box, with 16 cores, 30G of ram, and a 3.2T EBS volume. The slow startup time
might be because of the issues with a high number of SSTables that Jeff Jirsa mentioned in
a comment on the SO post, but I am at a loss for the OOM issues. I've tried:

  *   Changing from CMS to G1 GC, which seemed to have helped a bit
  *   Upgrading from 3.5 to 3.9, which did not seem to help
  *   Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to help, but I'm
still having issues

I'd appreciate any suggestions on what else I can try to track down the cause of these OOM

- Mike

