Hi Frank,

The "9391" under RevokeBias is the number of milliseconds spent synchronising on the safepoint prior to the VM operation, i.e. the time it took to ensure all application threads were stopped. So this is the culprit. Notice that the time spent spinning/blocking for the threads we are supposed to be waiting on is very low; it looks to me that this is time spent waiting for CMS threads to yield, though it is very hard to say with absolute certainty. It doesn't look like the issue is actually the RevokeBias itself, anyway.

I think we should take this off list. It definitely warrants a ticket, though I expect this will be difficult to pin down, so you will have to be willing to experiment a bit with us, but we would be very grateful for the help. If you can pin down and share a specific workload that triggers this we may be able to do it without you though!

It's possible that this is a JVM issue, but if so there may be some remedial action we can take anyway. There are some more flags we should add, but we can discuss that once you open a ticket. If you could include the strange JMX error as well, that might be helpful.

Thanks,

Benedict


On 3 February 2014 15:34, Frank Ng <fntemk@gmail.com> wrote:
I was able to send SafePointStatistics to another log file via the additional JVM flags and recently noticed a pause of 9.3936600 seconds.  Here are the log entries:

GC Log file:
-------------------
2014-01-31T07:49:14.755-0500: 137460.842: Total time for which application threads were stopped: 0.1095540 seconds
2014-01-31T07:51:01.870-0500: 137567.957: Total time for which application threads were stopped: 9.3936600 seconds
2014-01-31T07:51:02.537-0500: 137568.623: Total time for which application threads were stopped: 0.1207440 seconds

JVM Stdout Log File:
-------------------------------
         vmop                     [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137460.734: GenCollectForAllocation          [     421          0            0   ]    [     0     0     23      0     84 ]  0
         vmop                     [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137558.562: RevokeBias                       [     462          2            9   ]    [    13     0   9391      1      0 ]  0
<writer thread='47436187662656'/>
<dependency_failed type='leaf_type' ctxk='javax/management/ObjectName$Property' witness='javax/management/ObjectName$PatternProperty' stamp='137568.503'/>
<writer thread='47436033530176'/>
         vmop                     [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137568.500: Deoptimize                       [     481          1            5   ]    [     0     0    118      0      1 ]  0
         vmop                     [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137569.625: no vm operation                  [     483          0            1   ]    [     0     0     18      0      0 ]  0
         vmop                     [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137571.641: no vm operation                  [     483          0            1   ]    [     0     0     42      1      0 ]  0
         vmop                     [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137575.703: no vm operation                  [     483          0            1   ]    [     0     0     25      1      0 ]  0


If SafepointStatistics are printed before the Application Stop times, then it seems that the RevokeBias was the cause of the pause.
If SafepointStatistics are printed after the Application Stop times, then it seems that the Deoptimize was the cause of the pause.
In addition, I see a strange dependency failed error relating to JMX in the JVM stdout log file.

thanks....


On Wed, Jan 29, 2014 at 4:44 PM, Benedict Elliott Smith <belliottsmith@datastax.com> wrote:
Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path} -XX:+LogVMOutput

I never figured out what kills stdout for C*. It's a library we depend on, didn't try too hard to figure out which one.


On 29 January 2014 21:07, Frank Ng <fntemk@gmail.com> wrote:
Benedict,
Thanks for the advice.  I've tried turning on PrintSafepointStatistics.  However, that info is only sent to the STDOUT console.  The cassandra startup script closes the STDOUT when it finishes, so nothing is shown for safepoint statistics once it's done starting up.  Do you know how to startup cassandra and send all stdout to a log file and tell cassandra not to close stdout?

Also, we have swap turned off as recommended.

thanks


On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <belliottsmith@datastax.com> wrote:
Frank,

The same advice for investigating holds: add the VM flags -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1   (you could put something above 1 there, to reduce the amount of logging, since a pause of 52s will be pretty obvious even if aggregated with lots of other safe points; the count is the number of safepoints to aggregate into one log message)

52s is a very extreme pause, and I would be surprised if revoke bias could cause this. I wonder if the VM is swapping out.


On 29 January 2014 19:02, Frank Ng <fntemk@gmail.com> wrote:
Thanks for the update.  Our logs indicated that there were 0 pending for CompactionManager at that time.  Also, there were no nodetool repairs running at that time.  The log statements above state that the application had to stop to reach a safepoint.  Yet, it doesn't say what is requesting the safepoint.


On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <shaochuan.wang@bloomreach.com> wrote:
We had similar latency spikes when pending compactions can't keep it up or repair/streaming taking too much cycles. 


On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng <fntemk@gmail.com> wrote:
All,

We've been having intermittent long application pauses (version 1.2.8) and not sure if it's a cassandra bug.  During these pauses, there are dropped messages in the cassandra log file along with the node seeing other nodes as down.  We've turned on gc logging and the following is an example of a long "stopped" or pause event in the gc.log file.

2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which application threads were stopped: 0.091450 seconds
2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which application threads were stopped: 51.8190260 seconds
2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which application threads were stopped: 0.005470 seconds

As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs pause.  There were no GC log events between those 2 log statements.  Since there's no GC logs in between, something else must be causing the long stop time to reach a safepoint.

Could there be a Cassandra thread that is taking a long time to reach a safepoint and what is it trying to do? Along with the node seeing other nodes as down in the cassandra log file, the StatusLogger shows 1599 Pending in ReadStage and 9 Pending in MutationStage.

There is mention of cassandra batch revoke bias locks as a possible cause (not GC) via:
http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html

We have JNA, no swap, and the cluster runs fine besides there intermittent long pause that can cause a node to appear down to other nodes.  Any ideas as the cause of the long pause above? It seems not related to GC.

thanks.