cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From graham sanderson <gra...@vast.com>
Subject Re: Intermittent long application pauses on nodes
Date Fri, 24 Oct 2014 19:38:31 GMT
This certainly sounds like a JVM bug.

We are running C* 2.0.9 on pretty high end machines with pretty large heaps, and don’t seem
to have seen this (note we are on 7u67, so that might be an interesting data point, though
since the old thread predated that probably not)

1) From the app/java side, I’d obviously see if you can identify anything which always coincides
with this - repair, compaction etc
2) From the VM side (given that this as Benedict mentioned) some threads are taking a long
time to rendezvous at the safe point, and it is probably not application threads, I’d look
what GC threads, compiler threads etc might be doing. As mentioned it shouldn’t be anything
to do with operations which run at a safe point anyway (e.g. scavenge)
	a) So look at what CMS is doing at the time and see if you can correlate
	b) Check Oracle for related bugs - didn’t obviously see any, but there have been some complaints
related to compilation and safe points
	c) Add any compilation tracing you can
	d) Kind of important here - see if you can figure out via dtrace, system tap, gdb or whatever,
what the threads are doing when this happens. Sadly it doesn’t look like you can figure
out when this is happening (until afterwards) unless you have access to a debug JVM build
(and can turn on -XX:+TraceSafepoint and look for a safe point start without a corresponding
update within a time period) - if you don’t have access to that, I guess you could try and
get a dump every 2-3 seconds (you should catch a 9 second pause eventually!)

> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvankley@salesforce.com> wrote:
> 
> I'm also curious to know if this was ever resolved or if there's any other recommended
steps to take to continue to track it down. I'm seeing the same issue in our production cluster,
which is running Cassandra 2.0.10 and JVM 1.7u71, using the CMS collector. Just as described
above, the issue is long "Total time for which application threads were stopped" pauses that
are not a direct result of GC pauses (ParNew, initial mark or remark). When I enabled the
safepoint logging I saw the same result, long "sync" pause times with short spin and block
times, usually with the "RevokeBias" description. We're seeing pause times sometimes in excess
of 10 seconds, so it's a pretty debilitating issue. Our machines are not swapping (or even
close to it) or having other load issues when these pauses occur. Any ideas would be very
appreciated. Thanks!


Mime
View raw message