incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <peter.schul...@infidyne.com>
Subject Re: ParNew (promotion failed)
Date Mon, 28 Mar 2011 18:20:47 GMT
> But he's talking about "promotion failed" which is about heap
> fragmentation, not "concurrent mode failure" which would indicate CMS
> too late.  So increasing young generation size + tenuring threshold is
> probably the way to go (especially in a read-heavy workload;
> increasing tenuring will just mean copying data in memtables around
> between survivor spaces for a write-heavy load).

Thanks for the catch. You're right.

For interested parties:

This caused me to look into when 'promotion failed' and 'concurrent
mode failure' are actually reported. WIth some background here (from
2006, so potentially out of date):

  http://blogs.sun.com/jonthecollector/entry/when_the_sum_of_the

I looked at a semi-recent openjdk7 (so it may have changed since 1.6).
"concurrent mode failure" seems to be logged in two cases; one is
CMSCollector::do_mark_sweep_work(). The other is
CMSCollector::acquire_control_and_collect().

The former is called by the latter if it is determined that compaction
should happens, which seems to boil down to whether the the
incremental collection is "believed" to fail (my source navigation fu
failed me and I'm for some reason unable to find the implementation of
collection_attempt_is_safe() that applies...). The other concurrent
mode failure is if acquire_control_and_collect() determines that one
is already in progress.

That seems consistent with the blog entry.

"promotion failed" seems reported when an actual
next_gen->par_promote() call fails for a specific object.

So, my reading is that while 'promotion failed' can indeed be an
indicator of promotion failure due to fragmentation alone (if a
promotion were to fail in spite of there being plenty of free space
left), it can also have a cause overlapping with concurrent mode
failure in case a young-gen collection was attempted under the belief
that there would be enough space - only to then fail.

However, given the reported numbers (CMS:
1341669K->1142937K(2428928K)) it does seem clear that finding
contiguous free space is indeed the problem.

Running with -XX:PrintFLSStatistics=1 may yield interesting results,
but of course won't actually help.

-- 
/ Peter Schuller

Mime
View raw message