cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zsolt Pálmai <>
Subject OOM after a while during compacting
Date Thu, 05 Apr 2018 11:47:31 GMT

I have a setup with 4 AWS nodes (m4xlarge - 4 cpu, 16gb ram, 1TB ssd each)
and when running the nodetool compact command on any of the servers I get
out of memory exception after a while.

- Before calling the compact first I did a repair and before that there was
a bigger update on a lot of entries so I guess a lot of sstables were
created. The reapir created around ~250 pending compaction tasks, 2 of the
nodes I managed to finish with upgrading to a 2xlarge machine and twice the
heap (but running the compact on them manually also killed one :/ so this
isn't an ideal solution)

Some more info:
- Version is the newest 3.11.2 with java8u116
- Using LeveledCompactionStrategy (we have mostly reads)
- Heap size is set to 8GB
- Using G1GC
- I tried moving the memtable out of the heap. It helped but I still got an
OOM last night
- Concurrent compactors is set to 1 but it still happens and also tried
setting throughput between 16 and 128, no changes.
- Storage load is 127Gb/140Gb/151Gb/155Gb
- 1 keyspace, 16 tables but there are a few SASI indexes on big tables.
- The biggest partition I found was 90Mb but that table has only 2 sstables
attached and compacts in seconds. The rest is mostly 1 line partition with
a few 10KB of data.
- Worst SSTable case: SSTables in each level: [1, 20/10, 106/100, 15, 0, 0,
0, 0, 0]

In the metrics it looks something like this before dying:

What the heap dump looks like of the top objects:

The load is usually pretty low, the nodes are almost idling (avg 500
reads/sec, 30-40 writes/sec with occasional few second spikes with >100
writes) and the pending tasks is also around 0 usually.

Any ideas? I'm starting to run out of ideas. Maybe the secondary indexes
cause problems? I could finish some bigger compactions where there was no
index attached but I'm not sure 100% if this is the cause.


View raw message