cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Stump (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation
Date Thu, 20 Nov 2014 19:43:34 GMT


Matt Stump commented on CASSANDRA-8150:

I wanted to add more evidence and exposition for the above recommendations so that my argument
can be better understood.

Young generation GC is pretty simple. The new heap is broken down into three segments, eden,
and two survivor spaces s0 and s1. All new objects are allocated in eden, and once eden reaches
a size threshold a minor GC is triggered. After the minor GC all surviving objects are moved
to one of the survivor spaces, only one survivor space is active at a time. At the same time
that GC for eden is triggered a GC for the active survivor space is also triggered. All live
objects from both eden and the active survivor space are copied to the other survivor space
and both eden and the previously active survivor space is wiped clean. Objects will bounce
between the different survivor spaces until the MaxTenuringThreshold is hit (default in C*
is 1). Once an object survives MaxTenuringThreshold number of collections it's copied to the
tenured space which is governed by a different collector, in our instance CMS, but it could
just as easily be G1. This act of copying is called promotion. The promotion from young generation
to tenured space is what takes a long time. So if you see long ParNew GC pauses it's because
many objects are being promoted. You decrease ParNew collection times by decreasing promotion.

What can cause many objects to be promoted? It's objects that have survived both the initial
eden space collection and MaxTenuringThreshold number of collections in the survivor space.
The main tunables are the size of the various spaces in young gen, and the MaxTenuringThreshold.
By increasing the young generation space it decreases the frequency at which we have to run
GC because more objects can accumulate before we reach 75% capacity. By increasing the young
generation and the MaxTenuringThreshold you give the short lived objects more time to die,
and dead objects don't get promoted. 

The vast majority of objects in C* are ephemeral short lived objects. The only thing that
should live in tenured space is the key cache and in releases < 2.1 memtables. If most
objects die in survivor space you've solved the long GC pauses for both young gen and tenured

As a data point with the mixed cluster we're we've been experimenting with these options most
aggressively the longest CMS pause in a 24 hour period went from > 10s to less than 900ms
and most nodes experienced a max of less than 500ms. This is just the max CMS which could
include an outlier like defragmentation. Average CMS is significantly less, less than 100ms.
For ParNew collections we went from many many pauses in excess of 200ms to a max of 15ms cluster
wide and an average of 5ms. ParNew collection frequency decreased from 1 per second to one
every 10s worst case to the average case of one every 16 seconds.

This also unlocks additional throughput on large machines. For 20 cores machines I was able
to increase throughput from 75k TPS to 110-120k TPS. For a 40 core machine we more than doubled
request throughput and significantly increased compaction throughput. 

I've asked a number of other larger customers to help validate the new settings. I now view
GC pauses as a mostly solvable issue.

> Simplify and enlarge new heap calculation
> -----------------------------------------
>                 Key: CASSANDRA-8150
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Config
>            Reporter: Matt Stump
>            Assignee: Brandon Williams
> It's been found that the old twitter recommendations of 100m per core up to 800m is harmful
and should no longer be used.
> Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 or 1/4 is
debatable and I'm open to suggestions. If I were to hazard a guess 1/3 is probably better
for releases greater than 2.1.

This message was sent by Atlassian JIRA

View raw message