cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: Intermittent long application pauses on nodes
Date Thu, 30 Jan 2014 10:35:04 GMT
>
>
> I never figured out what kills stdout for C*. It's a library we depend on,
> didn't try too hard to figure out which one.
>

Nah, it's Cassandra itself (in
org.apache.cassandra.service.CassandraDaemon.activate()), but you can pass
-f (for 'foreground') to not do it.


>
>
> On 29 January 2014 21:07, Frank Ng <fntemk@gmail.com> wrote:
>
>> Benedict,
>> Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
>> However, that info is only sent to the STDOUT console.  The cassandra
>> startup script closes the STDOUT when it finishes, so nothing is shown for
>> safepoint statistics once it's done starting up.  Do you know how to
>> startup cassandra and send all stdout to a log file and tell cassandra not
>> to close stdout?
>>
>> Also, we have swap turned off as recommended.
>>
>> thanks
>>
>>
>> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
>> belliottsmith@datastax.com> wrote:
>>
>>> Frank,
>>>
>>>
>>> The same advice for investigating holds: add the VM flags -XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1   (you could put something above 1 there, to reduce the
amount of logging, since a pause of 52s will be pretty obvious even if aggregated with lots
of other safe points; the count is the number of safepoints to aggregate into one log message)
>>>
>>>
>>> 52s is a very extreme pause, and I would be surprised if revoke bias could cause
this. I wonder if the VM is swapping out.
>>>
>>>
>>>
>>> On 29 January 2014 19:02, Frank Ng <fntemk@gmail.com> wrote:
>>>
>>>> Thanks for the update.  Our logs indicated that there were 0 pending
>>>> for CompactionManager at that time.  Also, there were no nodetool repairs
>>>> running at that time.  The log statements above state that the application
>>>> had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
>>>> the safepoint.
>>>>
>>>>
>>>> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
>>>> shaochuan.wang@bloomreach.com> wrote:
>>>>
>>>>> We had similar latency spikes when pending compactions can't keep it
>>>>> up or repair/streaming taking too much cycles.
>>>>>
>>>>>
>>>>> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng <fntemk@gmail.com> wrote:
>>>>>
>>>>>> All,
>>>>>>
>>>>>> We've been having intermittent long application pauses (version
>>>>>> 1.2.8) and not sure if it's a cassandra bug.  During these pauses,
there
>>>>>> are dropped messages in the cassandra log file along with the node
seeing
>>>>>> other nodes as down.  We've turned on gc logging and the following
is an
>>>>>> example of a long "stopped" or pause event in the gc.log file.
>>>>>>
>>>>>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
>>>>>> application threads were stopped: 0.091450 seconds
>>>>>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
>>>>>> application threads were stopped: 51.8190260 seconds
>>>>>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
>>>>>> application threads were stopped: 0.005470 seconds
>>>>>>
>>>>>> As seen above, there was a 0.091450 secs pause, then a 51.8190260
>>>>>> secs pause.  There were no GC log events between those 2 log statements.
>>>>>> Since there's no GC logs in between, something else must be causing
the
>>>>>> long stop time to reach a safepoint.
>>>>>>
>>>>>> Could there be a Cassandra thread that is taking a long time to reach
>>>>>> a safepoint and what is it trying to do? Along with the node seeing
other
>>>>>> nodes as down in the cassandra log file, the StatusLogger shows 1599
>>>>>> Pending in ReadStage and 9 Pending in MutationStage.
>>>>>>
>>>>>> There is mention of cassandra batch revoke bias locks as a possible
>>>>>> cause (not GC) via:
>>>>>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>>>>>>
>>>>>> We have JNA, no swap, and the cluster runs fine besides there
>>>>>> intermittent long pause that can cause a node to appear down to other
>>>>>> nodes.  Any ideas as the cause of the long pause above? It seems
not
>>>>>> related to GC.
>>>>>>
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message