cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasileios Vlachos <vasileiosvlac...@gmail.com>
Subject Re: GC and compaction behaviour in a multi-DC environment
Date Thu, 17 Dec 2015 11:46:36 GMT
Thanks for your input Anuj.

Looking at the heap utilisation graph for cassandra03 at DC1 (and if I got
this right), the suspicion is that new generation is not large enough to
keep up with the newly created objects from compactions and memtable
flushes. This will likely result in early promotions which will then cause
the pauses to be more severe.

I still don't understand why the survivor utilisation drops during
compactions. Any thoughts on this? Unless it's the graph sampling rate that
skews the data. Bearing in mind that the sampling rate is 1 minute, it
could be that the survivors are filling up so quickly that our monitoring
catches the survivors at the wrong time. I still have doubts though if
that's the case (how can the monitoring consistently catch this at the
wrong time? Sampling is constant, but workload is bursty which makes NewGen
and Survivors to fill up accordingly).

I don't disagree, GC pauses are not huge. There are no signs of CPU
pressure on any server either. Which is why any pauses above 200ms are just
a sign of bad configuration (at least that's what I'm thinking so far).
I'll investigate the suggested changes. Regarding
memtable_total_space_in_mb, we are not using SSDs, but I am only expecting
the disks to really struggle when compactions occur at the same time. By
setting MAX_HEAP_SIZE to 6G, memtable_total_space_in_mb will decrease
anyway, so I might tweak this gradually/last.

I am not sure if looking at my setup I can describe the workload of the 2
DCs identical. Yes, whatever writes occur on DC1 are replicated to DC2
(eventually/hopefully), but I'm not sure if this puts the same pressure on
either end. I'd rather expect DC2 to be less pressured if anything. That
makes me think that there is a problem when looking at the graphs and the
GC numbers I provided in the first email.

Thanks for the link regarding the recommended settings; the only things
that are different are:

1. In /etc/security/limits.d/cassandra.conf:
    cassandra  -  nproc    8096  (instead of 32768)
2. sudo blockdev --report /dev/<device> gives 256 (instead of 128; but it's
not 65536 as the link explicitly suggests is avoided)

Everything else is as recommended. Not sure if noproc and RA are likely to
be the cause?

Thanks,
Vasilis

On Wed, Dec 16, 2015 at 5:33 PM, Anuj Wadehra <anujw_2003@yahoo.co.in>
wrote:

> Hi Vasileios,
>
> My comments:
>
> 1. I am not certain that Heap utilisation as seen from the graphs is
> healthy.
>
> Also, I'm not sure if the low utilisation of the survivor spaces as seen
> above is expected or not during GC activity. How do these two things relate
> (GC - compactions)?
>
>  What makes old generation keep on increasing when the survivors are
> underutilised (~ 5%)?
>
> Anuj:I think your new gen n tenuring threshold is too small. Memtable and
> compaction objects may move to old gen too quickly as survivor doesnt have
> appropriate space. By default, memtable_total_space_in_mb  is 1/4 which
> means 2Gb for 8gb heap. Moreover 8 gb heap allocation on a 16gb system is
> high.
>
> Even though GC pauses are not huge, I would suggest you to try following
> settings:
>
> Memtable_total_space_in_mb=500
> Above setting will lead to more IO but on SSD thats ok.
>
> MAX_HEAP_SIZE="6G"
> HEAP_NEWSIZE="1200MB"
> -XX:SurvivorRatio=2
> -XX:MaxTenuringThreshold=8
> -XX:CMSInitiatingOccupancyFraction=50
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=16384"
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
> JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
>
> I'd appreciate your input on this. In additio, is it normal for DC2 to
> have such a huge difference in GC activity in comparison to DC1?
>
> Anuj:I dont think that major GC differences are possible with same setups
> n workload. Make sure that you follow all DataStax recommended settings at:
> https://docs.datastax.com/en/cassandra/2.0/cassandra/install/installRecommendSettings.html
>
> 2. Compaction activity (as in frequency of compactions, not number of...)
> seems to be comparable for both DCs (with the exception of point 3 above),
> so I'm not sure why the number of Data.db files is consistently higher in
> DC2. Is this something important or it's a minor detail I shouldn't care
> about?
>
> Anuj:.10 n 15 is not significant. You can ignore it.
>
> Thanks
>
> Anuj
> ------------------------------
> *From*:"Vasileios Vlachos" <vasileiosvlachos@gmail.com>
> *Date*:Wed, 16 Dec, 2015 at 10:16 am
> *Subject*:Re: GC and compaction behaviour in a multi-DC environment
>
> Thanks for your reply,
>
> Apologies, I didn't explain that properly... When I say average, I mean
> the average of the 10 samples that appear in the logs, not the average
> across all GCs that happen over time. And the same applies for the both DCs.
>
> Tenuring threshold has been left to the default value which if I remember
> correctly must be 1 (I'll check again tomorrow). I was hoping that the new
> generation is adequate for the tenuring threshold to not be an issue.
>
> But from all the searching that I've done, when people have GC issues,
> their graphs make it obvious that there is something wrong. My problem is
> that I have too many unknowns at the moment to conclude that there is
> something wrong. All I've done is report what I see and keep on
> investigating whilst I asked for some input here.
>
> On Tue, Dec 15, 2015 at 8:06 PM, Kai Wang <depend@gmail.com> wrote:
>
>> Check MaxTenuringThreshold in your cassandra-env.sh. If that threshold is
>> too low, objects will be moved to old gen too quickly.
>>
>> I am little confused by your GC numbers on DC1. If DC1 only exceeds 200ms
>> GC threshold less than 10 times for 4 days, how can its average GC duration
>> be 400ms? Did I miss anything here?
>>
>> On Tue, Dec 15, 2015 at 6:09 AM, Vasileios Vlachos <
>> vasileiosvlachos@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We are running Cassandra 2.0.16 across 2 DCs at the moment, each of
>>> which has 4 nodes. Replication factor is 3 for all KS and all applications
>>> write/read using LOCAL_QUORUM. So, if DC1 is what's regarded "local",
>>> then DC2 gets all writes asynchronously. Nothing writes directly to DC2,
>>> traffic flows only from Clients -> DC1 -> DC2. Cassandra runs on physical
>>> servers at DC1, whereas at DC2 runs on virtual machines (we use VMWare ESXi
>>> 5.1). Both physical and virtual servers, however, have the same amount of
>>> resources available (cpu/memory/disks etc). All boxes have 16G of RAM.
>>> MAX_HEAP_SIZE is 8G and HEAP_NEWSIZE is 600M. We use the default CMS;
>>> we haven't switched to G1.
>>>
>>> *Observations:*
>>>
>>> 1. Number of GC pauses on DC2 appear to be significantly more frequent.
>>> On DC1 they rarely exceed the 200ms threshold that makes them appear in the
>>> logs. To give you some numbers:
>>>
>>> DC1:
>>>     node1: 2 pauses logged over the past 4 days
>>>     node2: 5
>>>     node3: 2
>>>     node4: 9
>>>
>>> DC2:
>>>     node1: 2475 pauses logged over the past 4 days
>>>     node2: 3478
>>>     node3: 3817
>>>     node4: 2472
>>>
>>> GC pause duration varies; for DC1 is around 400ms, for DC2 the average
>>> is about 400ms as well, but there are several which exceed 1 or even 2
>>> seconds.
>>>
>>> *DC1 graphs:*
>>>
>>> [DC1 - node1 graph]:
>>>
>>> [image: Inline image 1]
>>> [DC1 - node2 graph]:
>>>
>>> [image: Inline image 2]
>>> [DC1 - node3 graph]:
>>>
>>> [image: Inline image 3]
>>> [DC1 - node4 graph]:
>>>
>>> [image: Inline image 4]
>>>
>>> *DC2 graphs:*
>>>
>>> [DC2 - node1 graph]:
>>>
>>> [image: Inline image 6]
>>> [DC - node2 graph]:
>>>
>>> [image: Inline image 7]
>>> [DC2 - node3 graph]:
>>>
>>> [image: Inline image 8]
>>> [DC2 - node4 graph]:
>>>
>>> [image: Inline image 9]
>>>
>>> The low utilisation of the survivor spaces (presented as a "gap" in the
>>> graphs above, cassandra03 graph for example) correlates with compaction
>>> activity on the same box:
>>>
>>> [image: Inline image 10]
>>>
>>> 2. ~10 *Data.db files per KS on DC1 nodes, ~15 *Data.db files per KS on
>>> DC2 nodes
>>>
>>> 3. We are aware of CASSANDRA-9662 (thanks to this list!), but another
>>> observation is that our monitoring system on DC2 seems to be reporting
>>> thousands of compactions more frequently:
>>>
>>> DC1 Compaction Activity:
>>>
>>> [image: Inline image 11]
>>>
>>> DC2 Compaction Activity:
>>>
>>> [image: Inline image 12]
>>>
>>> *Questions:*
>>>
>>> 1. I am not certain that Heap utilisation as seen from the graphs is
>>> healthy. I'd appreciate your input on this. In additio, is it normal for
>>> DC2 to have such a huge difference in GC activity in comparison to DC1?
>>> Also, I'm not sure if the low utilisation of the survivor spaces as seen
>>> above is expected or not during GC activity. How do these two things relate
>>> (GC - compactions)? What makes old generation keep on increasing when the
>>> survivors are underutilised (~ 5%)?
>>>
>>> 2. Compaction activity (as in frequency of compactions, not number
>>> of...) seems to be comparable for both DCs (with the exception of point 3
>>> above), so I'm not sure why the number of Data.db files is consistently
>>> higher in DC2. Is this something important or it's a minor detail I
>>> shouldn't care about?
>>>
>>> 3. I'm not sure I have a question regarding observation #3, because I'm
>>> going to upgrade to 2.0.17 (at least), but I just included it here in case
>>> it helps with the first two observations.
>>>
>>> Thanks in advance for any help!
>>>
>>
>>
>

Mime
View raw message