incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Ng <fnt...@gmail.com>
Subject Re: Intermittent long application pauses on nodes
Date Mon, 03 Feb 2014 15:34:58 GMT
I was able to send SafePointStatistics to another log file via the
additional JVM flags and recently noticed a pause of 9.3936600 seconds.
Here are the log entries:

GC Log file:
-------------------
2014-01-31T07:49:14.755-0500: 137460.842: Total time for which application
threads were stopped: 0.1095540 seconds
2014-01-31T07:51:01.870-0500: 137567.957: Total time for which application
threads were stopped: 9.3936600 seconds
2014-01-31T07:51:02.537-0500: 137568.623: Total time for which application
threads were stopped: 0.1207440 seconds

JVM Stdout Log File:
-------------------------------
         vmop                     [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137460.734: GenCollectForAllocation          [     421
0            0   ]    [     0     0     23      0     84 ]  0
         vmop                     [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137558.562: RevokeBias                       [     462
2            9   ]    [    13     0   9391      1      0 ]  0
<writer thread='47436187662656'/>
<dependency_failed type='leaf_type'
ctxk='javax/management/ObjectName$Property'
witness='javax/management/ObjectName$PatternProperty' stamp='137568.503'/>
<writer thread='47436033530176'/>
         vmop                     [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137568.500: Deoptimize                       [     481
1            5   ]    [     0     0    118      0      1 ]  0
         vmop                     [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137569.625: no vm operation                  [     483
0            1   ]    [     0     0     18      0      0 ]  0
         vmop                     [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137571.641: no vm operation                  [     483
0            1   ]    [     0     0     42      1      0 ]  0
         vmop                     [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
137575.703: no vm operation                  [     483
0            1   ]    [     0     0     25      1      0 ]  0

If SafepointStatistics are printed before the Application Stop times, then
it seems that the RevokeBias was the cause of the pause.
If SafepointStatistics are printed after the Application Stop times, then
it seems that the Deoptimize was the cause of the pause.
In addition, I see a strange dependency failed error relating to JMX in the
JVM stdout log file.

thanks....


On Wed, Jan 29, 2014 at 4:44 PM, Benedict Elliott Smith <
belliottsmith@datastax.com> wrote:

> Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path}
> -XX:+LogVMOutput
>
> I never figured out what kills stdout for C*. It's a library we depend on,
> didn't try too hard to figure out which one.
>
>
> On 29 January 2014 21:07, Frank Ng <fntemk@gmail.com> wrote:
>
>> Benedict,
>> Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
>> However, that info is only sent to the STDOUT console.  The cassandra
>> startup script closes the STDOUT when it finishes, so nothing is shown for
>> safepoint statistics once it's done starting up.  Do you know how to
>> startup cassandra and send all stdout to a log file and tell cassandra not
>> to close stdout?
>>
>> Also, we have swap turned off as recommended.
>>
>> thanks
>>
>>
>> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
>> belliottsmith@datastax.com> wrote:
>>
>>> Frank,
>>>
>>>
>>> The same advice for investigating holds: add the VM flags -XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1   (you could put something above 1 there, to reduce the
amount of logging, since a pause of 52s will be pretty obvious even if aggregated with lots
of other safe points; the count is the number of safepoints to aggregate into one log message)
>>>
>>>
>>> 52s is a very extreme pause, and I would be surprised if revoke bias could cause
this. I wonder if the VM is swapping out.
>>>
>>>
>>>
>>> On 29 January 2014 19:02, Frank Ng <fntemk@gmail.com> wrote:
>>>
>>>> Thanks for the update.  Our logs indicated that there were 0 pending
>>>> for CompactionManager at that time.  Also, there were no nodetool repairs
>>>> running at that time.  The log statements above state that the application
>>>> had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
>>>> the safepoint.
>>>>
>>>>
>>>> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
>>>> shaochuan.wang@bloomreach.com> wrote:
>>>>
>>>>> We had similar latency spikes when pending compactions can't keep it
>>>>> up or repair/streaming taking too much cycles.
>>>>>
>>>>>
>>>>> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng <fntemk@gmail.com> wrote:
>>>>>
>>>>>> All,
>>>>>>
>>>>>> We've been having intermittent long application pauses (version
>>>>>> 1.2.8) and not sure if it's a cassandra bug.  During these pauses,
there
>>>>>> are dropped messages in the cassandra log file along with the node
seeing
>>>>>> other nodes as down.  We've turned on gc logging and the following
is an
>>>>>> example of a long "stopped" or pause event in the gc.log file.
>>>>>>
>>>>>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
>>>>>> application threads were stopped: 0.091450 seconds
>>>>>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
>>>>>> application threads were stopped: 51.8190260 seconds
>>>>>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
>>>>>> application threads were stopped: 0.005470 seconds
>>>>>>
>>>>>> As seen above, there was a 0.091450 secs pause, then a 51.8190260
>>>>>> secs pause.  There were no GC log events between those 2 log statements.
>>>>>> Since there's no GC logs in between, something else must be causing
the
>>>>>> long stop time to reach a safepoint.
>>>>>>
>>>>>> Could there be a Cassandra thread that is taking a long time to reach
>>>>>> a safepoint and what is it trying to do? Along with the node seeing
other
>>>>>> nodes as down in the cassandra log file, the StatusLogger shows 1599
>>>>>> Pending in ReadStage and 9 Pending in MutationStage.
>>>>>>
>>>>>> There is mention of cassandra batch revoke bias locks as a possible
>>>>>> cause (not GC) via:
>>>>>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>>>>>>
>>>>>> We have JNA, no swap, and the cluster runs fine besides there
>>>>>> intermittent long pause that can cause a node to appear down to other
>>>>>> nodes.  Any ideas as the cause of the long pause above? It seems
not
>>>>>> related to GC.
>>>>>>
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message