cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasileios Vlachos <>
Subject GC and compaction behaviour in a multi-DC environment
Date Tue, 15 Dec 2015 11:09:46 GMT

We are running Cassandra 2.0.16 across 2 DCs at the moment, each of which
has 4 nodes. Replication factor is 3 for all KS and all applications
write/read using LOCAL_QUORUM. So, if DC1 is what's regarded "local", then
DC2 gets all writes asynchronously. Nothing writes directly to DC2, traffic
flows only from Clients -> DC1 -> DC2. Cassandra runs on physical servers
at DC1, whereas at DC2 runs on virtual machines (we use VMWare ESXi 5.1).
Both physical and virtual servers, however, have the same amount of
resources available (cpu/memory/disks etc). All boxes have 16G of RAM.
MAX_HEAP_SIZE is 8G and HEAP_NEWSIZE is 600M. We use the default CMS; we
haven't switched to G1.


1. Number of GC pauses on DC2 appear to be significantly more frequent. On
DC1 they rarely exceed the 200ms threshold that makes them appear in the
logs. To give you some numbers:

    node1: 2 pauses logged over the past 4 days
    node2: 5
    node3: 2
    node4: 9

    node1: 2475 pauses logged over the past 4 days
    node2: 3478
    node3: 3817
    node4: 2472

GC pause duration varies; for DC1 is around 400ms, for DC2 the average is
about 400ms as well, but there are several which exceed 1 or even 2 seconds.

*DC1 graphs:*

[DC1 - node1 graph]:

[image: Inline image 1]
[DC1 - node2 graph]:

[image: Inline image 2]
[DC1 - node3 graph]:

[image: Inline image 3]
[DC1 - node4 graph]:

[image: Inline image 4]

*DC2 graphs:*

[DC2 - node1 graph]:

[image: Inline image 6]
[DC - node2 graph]:

[image: Inline image 7]
[DC2 - node3 graph]:

[image: Inline image 8]
[DC2 - node4 graph]:

[image: Inline image 9]

The low utilisation of the survivor spaces (presented as a "gap" in the
graphs above, cassandra03 graph for example) correlates with compaction
activity on the same box:

[image: Inline image 10]

2. ~10 *Data.db files per KS on DC1 nodes, ~15 *Data.db files per KS on DC2

3. We are aware of CASSANDRA-9662 (thanks to this list!), but another
observation is that our monitoring system on DC2 seems to be reporting
thousands of compactions more frequently:

DC1 Compaction Activity:

[image: Inline image 11]

DC2 Compaction Activity:

[image: Inline image 12]


1. I am not certain that Heap utilisation as seen from the graphs is
healthy. I'd appreciate your input on this. In additio, is it normal for
DC2 to have such a huge difference in GC activity in comparison to DC1?
Also, I'm not sure if the low utilisation of the survivor spaces as seen
above is expected or not during GC activity. How do these two things relate
(GC - compactions)? What makes old generation keep on increasing when the
survivors are underutilised (~ 5%)?

2. Compaction activity (as in frequency of compactions, not number of...)
seems to be comparable for both DCs (with the exception of point 3 above),
so I'm not sure why the number of Data.db files is consistently higher in
DC2. Is this something important or it's a minor detail I shouldn't care

3. I'm not sure I have a question regarding observation #3, because I'm
going to upgrade to 2.0.17 (at least), but I just included it here in case
it helps with the first two observations.

Thanks in advance for any help!

View raw message