cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
Date Thu, 07 Apr 2011 14:28:25 GMT
No, 2252 is not suitable for backporting to 0.7.

On Thu, Apr 7, 2011 at 7:33 AM, ruslan usifov <> wrote:
> 2011/4/7 Jonathan Ellis <>
>> Hypothesis: it's probably the flush causing the CMS, not the snapshot
>> linking.
>> Confirmation possibility #1: Add a logger.warn to
>> CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
>> called, but let's rule it out.
>> Confirmation possibility #2: Force some flushes w/o snapshot.
>> Either way: "concurrent mode failure" is the easy GC problem.
>> Hopefully you really are seeing mostly that -- this means the JVM
>> didn't start CMS early enough, so it ran out of space before it could
>> finish the concurrent collection, so it falls back to stop-the-world.
>> The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
>> and (possibly) increasing heap capacity if your heap is simply too
>> full too much of the time.
>> You can also mitigate it by increasing the phi threshold for the
>> failure detector, so the node doing the GC doesn't mark everyone else
>> as dead.
>> (Eventually your heap will fragment and you will see STW collections
>> due to "promotion failed," but you should see that much less
>> frequently. GC tuning to reduce fragmentation may be possible based on
>> your workload, but that's out of scope here and in any case the "real"
>> fix for that is
> Jonatan do you have plans to backport this to 0.7 branch. (Because It's very
> hard to tune CMS, and if people is novice in java this task becomes much
> harder )

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support

View raw message