cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Wang <dep...@gmail.com>
Subject Re: GC and compaction behaviour in a multi-DC environment
Date Tue, 15 Dec 2015 20:06:44 GMT
Check MaxTenuringThreshold in your cassandra-env.sh. If that threshold is
too low, objects will be moved to old gen too quickly.

I am little confused by your GC numbers on DC1. If DC1 only exceeds 200ms
GC threshold less than 10 times for 4 days, how can its average GC duration
be 400ms? Did I miss anything here?

On Tue, Dec 15, 2015 at 6:09 AM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

> Hello,
>
> We are running Cassandra 2.0.16 across 2 DCs at the moment, each of which
> has 4 nodes. Replication factor is 3 for all KS and all applications
> write/read using LOCAL_QUORUM. So, if DC1 is what's regarded "local",
> then DC2 gets all writes asynchronously. Nothing writes directly to DC2,
> traffic flows only from Clients -> DC1 -> DC2. Cassandra runs on physical
> servers at DC1, whereas at DC2 runs on virtual machines (we use VMWare ESXi
> 5.1). Both physical and virtual servers, however, have the same amount of
> resources available (cpu/memory/disks etc). All boxes have 16G of RAM.
> MAX_HEAP_SIZE is 8G and HEAP_NEWSIZE is 600M. We use the default CMS; we
> haven't switched to G1.
>
> *Observations:*
>
> 1. Number of GC pauses on DC2 appear to be significantly more frequent. On
> DC1 they rarely exceed the 200ms threshold that makes them appear in the
> logs. To give you some numbers:
>
> DC1:
>     node1: 2 pauses logged over the past 4 days
>     node2: 5
>     node3: 2
>     node4: 9
>
> DC2:
>     node1: 2475 pauses logged over the past 4 days
>     node2: 3478
>     node3: 3817
>     node4: 2472
>
> GC pause duration varies; for DC1 is around 400ms, for DC2 the average is
> about 400ms as well, but there are several which exceed 1 or even 2 seconds.
>
> *DC1 graphs:*
>
> [DC1 - node1 graph]:
>
> [image: Inline image 1]
> [DC1 - node2 graph]:
>
> [image: Inline image 2]
> [DC1 - node3 graph]:
>
> [image: Inline image 3]
> [DC1 - node4 graph]:
>
> [image: Inline image 4]
>
> *DC2 graphs:*
>
> [DC2 - node1 graph]:
>
> [image: Inline image 6]
> [DC - node2 graph]:
>
> [image: Inline image 7]
> [DC2 - node3 graph]:
>
> [image: Inline image 8]
> [DC2 - node4 graph]:
>
> [image: Inline image 9]
>
> The low utilisation of the survivor spaces (presented as a "gap" in the
> graphs above, cassandra03 graph for example) correlates with compaction
> activity on the same box:
>
> [image: Inline image 10]
>
> 2. ~10 *Data.db files per KS on DC1 nodes, ~15 *Data.db files per KS on
> DC2 nodes
>
> 3. We are aware of CASSANDRA-9662 (thanks to this list!), but another
> observation is that our monitoring system on DC2 seems to be reporting
> thousands of compactions more frequently:
>
> DC1 Compaction Activity:
>
> [image: Inline image 11]
>
> DC2 Compaction Activity:
>
> [image: Inline image 12]
>
> *Questions:*
>
> 1. I am not certain that Heap utilisation as seen from the graphs is
> healthy. I'd appreciate your input on this. In additio, is it normal for
> DC2 to have such a huge difference in GC activity in comparison to DC1?
> Also, I'm not sure if the low utilisation of the survivor spaces as seen
> above is expected or not during GC activity. How do these two things relate
> (GC - compactions)? What makes old generation keep on increasing when the
> survivors are underutilised (~ 5%)?
>
> 2. Compaction activity (as in frequency of compactions, not number of...)
> seems to be comparable for both DCs (with the exception of point 3 above),
> so I'm not sure why the number of Data.db files is consistently higher in
> DC2. Is this something important or it's a minor detail I shouldn't care
> about?
>
> 3. I'm not sure I have a question regarding observation #3, because I'm
> going to upgrade to 2.0.17 (at least), but I just included it here in case
> it helps with the first two observations.
>
> Thanks in advance for any help!
>

Mime
View raw message