cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan van Kley <dvank...@salesforce.com>
Subject Re: Intermittent long application pauses on nodes
Date Mon, 27 Oct 2014 13:44:52 GMT
Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try
and see if that gives us anything to act on.

On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson <graham@vast.com> wrote:

> And -XX:SafepointTimeoutDelay=xxx
>
> to set how long before it dumps output (defaults to 10000 I believe)…
>
> Note it doesn’t actually timeout by default, it just prints the
> problematic threads after that time and keeps on waiting
>
> On Oct 24, 2014, at 2:44 PM, graham sanderson <graham@vast.com> wrote:
>
> Actually - there is
>
> -XX:+SafepointTimeout
>
> which will print out offending threads (assuming you reach a 10 second
> pause)…
>
> That is probably your best bet.
>
> On Oct 24, 2014, at 2:38 PM, graham sanderson <graham@vast.com> wrote:
>
> This certainly *sounds* like a JVM bug.
>
> We are running C* 2.0.9 on pretty high end machines with pretty large
> heaps, and don’t seem to have seen this (note we are on 7u67, so that might
> be an interesting data point, though since the old thread predated that
> probably not)
>
> 1) From the app/java side, I’d obviously see if you can identify anything
> which always coincides with this - repair, compaction etc
> 2) From the VM side (given that this as Benedict mentioned) some threads
> are taking a long time to rendezvous at the safe point, and it is probably
> not application threads, I’d look what GC threads, compiler threads etc
> might be doing. As mentioned it shouldn’t be anything to do with operations
> which run at a safe point anyway (e.g. scavenge)
> a) So look at what CMS is doing at the time and see if you can correlate
> b) Check Oracle for related bugs - didn’t obviously see any, but there
> have been some complaints related to compilation and safe points
> c) Add any compilation tracing you can
> d) Kind of important here - see if you can figure out via dtrace, system
> tap, gdb or whatever, what the threads are doing when this happens. Sadly
> it doesn’t look like you can figure out when this is happening (until
> afterwards) unless you have access to a debug JVM build (and can turn on
> -XX:+TraceSafepoint and look for a safe point start without a corresponding
> update within a time period) - if you don’t have access to that, I guess
> you could try and get a dump every 2-3 seconds (you should catch a 9 second
> pause eventually!)
>
> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvankley@salesforce.com>
> wrote:
>
> I'm also curious to know if this was ever resolved or if there's any other
> recommended steps to take to continue to track it down. I'm seeing the same
> issue in our production cluster, which is running Cassandra 2.0.10 and JVM
> 1.7u71, using the CMS collector. Just as described above, the issue is long
> "Total time for which application threads were stopped" pauses that are not
> a direct result of GC pauses (ParNew, initial mark or remark). When I
> enabled the safepoint logging I saw the same result, long "sync" pause
> times with short spin and block times, usually with the "RevokeBias"
> description. We're seeing pause times sometimes in excess of 10 seconds, so
> it's a pretty debilitating issue. Our machines are not swapping (or even
> close to it) or having other load issues when these pauses occur. Any ideas
> would be very appreciated. Thanks!
>
>
>
>
>

Mime
View raw message