cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregory Szorc <>
Subject RE: Nodes frozen in GC
Date Thu, 10 Mar 2011 21:43:48 GMT
I do believe there is a fundamental issue with compactions allocating too much memory and incurring
too many garbage collections (at least with 0.6.12).

On nearly every Cassandra node I operate, garbage collections simply get out of control during
compactions of any reasonably sized CF (>1GB). I can reproduce it on CF's with many wider
rows (1000's of columns) consisting of smaller columns (10's-100's of bytes) and CF's with
thinner rows (<20 columns) with larger columns (10's MBs) and everything in between.

From the GC logs, I can infer that Cassandra is allocating upwards of 4GB/s. I once gave the
JVM 30GB of heap and saw it run through the entire heap in a few seconds while doing a compaction!
It would continuously blow through the heap, incur a stop-the-world collection, and repeat.
Meanwhile, the listed compacted bytes from the JMX interface was never increasing and the
tmp sstable wasn't growing in size.

My current/relevant JVM args are as follows (running on Sun w/ JNA 3.2.7):

-Xms9G -Xmx9G -Xmn256M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram
-XX:+PrintTenuringDistribution -Xloggc:/var/log/cassandra/gc.log -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=3 -XX:CMSInitiatingOccupancyFraction=40
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSFullGCsBeforeCompaction=1

I've tweaked with nearly every setting imaginable (
is a great resource, BTW) and can't control the problem. No matter what I do, nothing can
solve the problem of Cassandra allocating objects faster than the GC can clean them. And,
when we're talking about >1GB/s of allocations, I don't think you can blame GC for not
keeping up.

Since there is no way to prevent these frequent stop-the-world collections, we get frequent
client timeouts and an occasional unavailable response if we're unfortunate to have a couple
of nodes compacting large CFs at the same time (which happens more than I'd like).

For the past two weeks, we had N=<replication factor> adjacent nodes in our cluster
that failed to perform their daily major compaction on a particular column family. All N would
spew GCInspector logs and the GC logs revealed heavy memory allocation rate. The only resolution
was to restart Cassandra to abort the compaction. I isolated one node from network connectivity
and restarted it in a cluster of 1 with no caching, memtables, or any operations. Under these
ideal compacting conditions, I still ran into issues. I experimented with extremely large
young generations (up to 10GB), very low CMSInitiatingOccupancyFraction, etc, but Cassandra
would always allocate faster than JVM could collect, eventually leading to stop-the-world.

Recently, we rolled out a change to the application accessing the cluster which effectively
resaved every column in every row. When this was mostly done, our daily major compaction for
the trouble CF that refused to compact for two weeks, suddenly completed! Most interesting.
(Although, it still went through memory to no end.)

One of my observations is that memory allocations during compaction seems to be mostly short-lived
objects. The young generation is almost never promoting objects to the tenured generation
(we changed our MaxTenuringThreshold to 3, from Cassandra's default of 1 to discourage early
promotion- a default of 1 seems rather silly to me). However, when the young generation is
being collected (which happens VERY often during compactions b/c allocation rate is so high),
objects are allocated directly into the tenured generation. Even with relatively short ParNew
collections (often <0.05s, almost always <0.1s wall time), these tenured allocations
quickly accumulate, initiating CMS and eventually stop-the-world.

Anyway, not sure how much additional writing is going to help resolve this issue. I have gobs
of GC logs and supplementary metrics data to back up my claims if those will help. But, I
have a feeling that if you just create a CF of a few GB and incur a compaction with the JVM
under a profiler, it will be pretty easy to identify the culprit. I've started down this path
and will let you know if I find anything. But, I'm no Java expert and am quite busy with other
tasks, so don't expect anything useful from me anytime soon.

I hope this information helps. If you need anything else, just ask, and I'll see what I can

Gregory Szorc

> -----Original Message-----
> From: [] On Behalf Of Peter
> Schuller
> Sent: Thursday, March 10, 2011 10:36 AM
> To: ruslan usifov
> Cc:
> Subject: Re: Nodes frozen in GC
> I think it would be very useful to get to the bottom of this but without
> further details (like the asked for GC logs) I'm not sure what to do/suggest.
> It's clear that a single CF with a 64 MB memtable flush threshold and without
> key cache and row cache and some bulk insertion, should not be causing the
> problems you are seeing, in general. Especially not with a
> > 5 gb heap size. I think it is highly likely that there is some
> little detail/mistake going on here rather than a fundamental issue.
> But regardless, it would be nice to discover what.
> --
> / Peter Schuller
View raw message