cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From graham sanderson <gra...@vast.com>
Subject Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts
Date Sat, 29 Nov 2014 00:56:22 GMT
I should note that the young gen size is just a tuning suggestion, not directly related to
your problem at hand.

You might want to make sure you don’t have issues with key/row cache.

Also, I’m assuming that your extra load isn’t hitting tables that you wouldn’t normally
be hitting.

> On Nov 28, 2014, at 6:54 PM, graham sanderson <graham@vast.com> wrote:
> 
> Your GC settings would be helpful, though you can see guesstimate by eyeballing (assuming
settings are the same across all 4 images)
> 
> Bursty load can be a big cause of old gen fragmentation (as small working set objects
tends to get spilled (promoted) along with memtable slabs which aren’t flushed quickly enough).
That said, empty fragmentation holes wouldn’t show up as “used” in your graph, and that
clearly looks like you are above your CMSIniatingOccupancyFraction and CMS is running continuously,
so they probably aren’t the issue here.
> 
> Other than trying a slightly larger heap to give you more head room, I’d also suggest
from eyeballing that you have probably let the JVM pick its own new gen size, and I’d suggest
it is too small. What to set it to really depends on your workload, but you could try something
in the 0.5gig range unless that makes your young gen pauses too long. In that case (or indeed
anyway) make sure you also have the latest GC settings (e.g. -XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways) on newer JVMs (to help the young gc pauses)
> 
>> On Nov 28, 2014, at 2:55 PM, Paulo Ricardo Motta Gomes <paulo.motta@chaordicsystems.com
<mailto:paulo.motta@chaordicsystems.com>> wrote:
>> 
>> Hello,
>> 
>> This is a recurrent behavior of JVM GC in Cassandra that I never completely understood:
when a node is UP for many days (or even months), or receives a very high load spike (3x-5x
normal load), CMS GC pauses start becoming very frequent and slow, causing periodic timeouts
in Cassandra. Trying to run GC manually doesn't free up memory. The only solution when a node
reaches this state is to restart the node.
>> 
>> We restart the whole cluster every 1 or 2 months, to avoid machines getting into
this crazy state. We tried tuning GC size and parameters, different cassandra versions (1.1,
1.2, 2.0), but this behavior keeps happening. More recently, during black friday, we received
about 5x our normal load, and some machines started presenting this behavior. Once again,
we restart the nodes an the GC behaves normal again.
>> 
>> I'm attaching a few pictures comparing the heap of "healthy" and "sick" nodes: http://imgur.com/a/Tcr3w
<http://imgur.com/a/Tcr3w>
>> 
>> You can clearly notice some memory is actually reclaimed during GC in healthy nodes,
while in sick machines very little memory is reclaimed. Also, since GC is executed more frequently
in sick machines, it uses about 2x more CPU than non-sick nodes.
>> 
>> Have you ever observed this behavior in your cluster? Could this be related to heap
fragmentation? Would using the G1 collector help in this case? Any GC tuning or monitoring
advice to troubleshoot this issue?
>> 
>> Any advice or pointers will be kindly appreciated.
>> 
>> Cheers,
>> 
>> -- 
>> Paulo Motta
>> 
>> Chaordic | Platform
>> www.chaordic.com.br <http://www.chaordic.com.br/>
>> +55 48 3232.3200
> 


Mime
View raw message