cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From graham sanderson <gra...@vast.com>
Subject Re: GC pauses affecting entire cluster.
Date Mon, 01 Jun 2015 21:29:03 GMT
Yes native_objects is the way to go… you can tell if memtables are you problem because you’ll
see promotion failures of objects sized 131074 dwords.

If your h/w is fast enough make your young gen as big as possible - we can collect 8G in sub
second always, and this gives you your best chance of transient objects (especially if you
still have thrift clients) leaking into the old gen. Moving to 2.1.x (and off heap memtables)
from 2.0.x we have reduced our old gen down from 16gig to 12gig and will keep shrinking it,
but have had no promotion failures yet, and it’s been several months.

Note we are running a patched 2.1.3, but 2.1.5 has the equivalent important bugs fixed (that
might have given you memory issues)

> On Jun 1, 2015, at 3:00 PM, Carl Hu <me@carlhu.com> wrote:
> 
> Thank you for the suggestion. After analysis of your settings, the basic hypothesis here
is to promote very quickly to Old Gen because of a rapid accumulation of heap usage due to
memtables. We happen to be running on 2.1, and I thought a more conservative approach that
your (quite aggressive gc settings) is to try the new memtable_allocation_type with offheap_objects
and see if the memtable pressure is relieved sufficiently such that the standard gc settings
can keep up.
> 
> The experiment is in progress and I will report back with the results.
> 
> On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra <anujw_2003@yahoo.co.in <mailto:anujw_2003@yahoo.co.in>>
wrote:
> We have write heavy workload and used to face promotion failures/long gc pauses with
Cassandra 2.0.x. I am not into code yet but I think that memtable and compaction related objects
have mid-life and write heavy workload is not suitable for generation collection by default.
So, we tuned JVM to make sure that minimum objects are promoted to Old Gen and achieved great
success in that:
> MAX_HEAP_SIZE="12G"
> HEAP_NEWSIZE="3G"
> -XX:SurvivorRatio=2
> -XX:MaxTenuringThreshold=20
> -XX:CMSInitiatingOccupancyFraction=70
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000"
> JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> We also think that default total_memtable_space_in_mb=1/4 heap is too much for write
heavy loads. By default, young gen is also 1/4 heap.We reduced it to 1000mb in order to make
sure that memtable related objects dont stay in memory for too long. Combining this with SurvivorRatio=2
and MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full GC observed.
> 
> Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs in RAID5.
> We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec
on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row
> 
> Yes. Node marking down has cascading effect. Within seconds all nodes in our cluster
are marked down. 
> 
> Thanks
> Anuj Wadehra
> 
> 
> 
> On Monday, 1 June 2015 7:12 PM, Carl Hu <me@carlhu.com <mailto:me@carlhu.com>>
wrote:
> 
> 
> We are running Cassandra version 2.1.5.469 on 15 nodes and are experiencing a problem
where the entire cluster slows down for 2.5 minutes when one node experiences a 17 second
stop-the-world gc. These gc's happen once every 2 hours. I did find a ticket that seems related
to this: https://issues.apache.org/jira/browse/CASSANDRA-3853 <https://issues.apache.org/jira/browse/CASSANDRA-3853>,
but Jonathan Ellis has resolved this ticket. 
> 
> We are running standard gc settings, but this ticket is not so much concerned with the
17 second gc on a single node (after all, we have 14 others), but that the cascading performance
problem.
> 
> We running standard values of dynamic_snitch_badness_threshold (0.1) and phi_convict_threshold
(8). (These values are relevant for the dynamic snitch routing requests away from the frozen
node or the failure detector marking the node as 'down').
> 
> We use the python client in default round robin mode, so all clients hits the coordinators
at all nodes in round robin. One theory is that since the coordinator on all nodes must hit
the frozen node at some point in the 17 seconds, each node's request queues fills up, and
the entire cluster thus freezes up. That would explain a 17 second freeze but would not explain
the 2.5 minute slowdown (10x increase in request latency @P50).
> 
> I'd love your thoughts. I've provided the GC chart here.
> 
> Carl
> 
> <d2c95dce-0848-11e5-91f7-6b223349fc14.png>
> 
> 
> 


Mime
View raw message