cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Torra <>
Subject Re: failing bootstraps with OOM
Date Thu, 03 Nov 2016 13:32:07 GMT
Hi Alex - I do monitor sstable counts and pending compactions, but probably not closely enough.
In 3/4 regions the cluster is running in, both counts are very high - ~30-40k sstables for
one particular CF, and on many nodes >1k pending compactions. I had noticed this before,
but I didn't have a good sense of what a "high" number for these values was.

It makes sense to me why this would cause the issues I've seen. After increasing concurrent_compactors
and compaction_throughput_mb_per_sec (to 8 and 64mb, respectively), I'm starting to see those
counts go down steadily. Hopefully that will resolve the OOM issues, but it looks like it
will take a while for compactions to catch up.

Thanks for the suggestions, Alex

From: Oleksandr Shulgin <<>>
Reply-To: "<>" <<>>
Date: Wednesday, November 2, 2016 at 1:07 PM
To: "<>" <<>>
Subject: Re: failing bootstraps with OOM

On Wed, Nov 2, 2016 at 3:35 PM, Mike Torra <<>>
> Hi All -
> I am trying to bootstrap a replacement node in a cluster, but it consistently fails to
bootstrap because of OOM exceptions. For almost a week I've been going through cycles of bootstrapping,
finding errors, then restarting / resuming bootstrap, and I am struggling to move forward.
Sometimes the bootstrapping node itself fails, which usually manifests first as very high
GC times (sometimes 30s+!), then nodetool commands start to fail with timeouts, then the node
will crash with an OOM exception. Other times, a node streaming data to this bootstrapping
node will have a similar failure. In either case, when it happens I need to restart the crashed
node, then resume the bootstrap.
> On top of these issues, when I do need to restart a node it takes a loooong time (
This exasperates the problem because it takes so long to find out if a change to the cluster
helps or if it still fails. I am in the process of upgrading all nodes in the cluster from
m4.xlarge to c4.4xlarge, and I am running Cassandra DDC 3.5 on all nodes. The cluster has
26 nodes spread across 4 regions in EC2. Here is some other relevant cluster info (also in
stack overflow post):
> Cluster Info
> Cassandra DDC 3.5
> EC2MultiRegionSnitch
> m4.xlarge, moving to c4.4xlarge
> Schema Info
> 3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default)
> no secondary indexes
> I am unsure what to try next. The node that is currently having this bootstrap problem
is a pretty beefy box, with 16 cores, 30G of ram, and a 3.2T EBS volume. The slow startup
time might be because of the issues with a high number of SSTables that Jeff Jirsa mentioned
in a comment on the SO post, but I am at a loss for the OOM issues. I've tried:
> Changing from CMS to G1 GC, which seemed to have helped a bit
> Upgrading from 3.5 to 3.9, which did not seem to help
> Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to help, but I'm still
having issues
> I'd appreciate any suggestions on what else I can try to track down the cause of these
OOM exceptions.


Do you monitor pending compactions and actual number of SSTable files?

On startup Cassandra needs to touch most of the data files and also seems to keep some metadata
about every relevant file in memory.  We once went into situation where we ended up with hundreds
of thousands of files per node which resulted in OOMs on every other node of the ring, and
startup time was of over half an hour (this was on version 2.1).

If you have much more files than you expect, then you should check and adjust your concurrent_compactors
and compaction_throughput_mb_per_sec settings.  Increase concurrent_compactors if you're behind
(pending compactions metric is a hint) and consider un-throttling compaction before your situation
is back to normal.


View raw message