cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <arodr...@gmail.com>
Subject Re: determining the cause of a high CPU / disk util node
Date Mon, 04 Sep 2017 16:57:37 GMT
Hi Andrew, I have seen a very similar problems in AWS where regularly a
node would perform badly for no reason (io-wait) + terrible load.

First thing would be to make sure this node is not having a specifically
high load for some good reason, imbalanced and really high load, bad
configuration, reading tombstones, big requests (maybe due to variable
partition size)... This can be checked using monitoring with per node
aggregation stats or using nodetool info. Make sure this node is acting in
a way that is comparable to other nodes. Due to dynamic snitching, this
node could even be receiving less informations than others.

A possibility is for the hardware to be somehow broken, from time to time,
a server is unhealthy on AWS :-). Maybe the disk this instance rely on is
broken.

Then be aware of what is called "noisy neighbours". Basically if using AWS
you are probably using a VM on a top of a bigger machine, possibly used by
more people. The disk is shared if using instance stores, even if using
instance stores. So if your 'neighbours' are having a fun time processing
data using spark or something like that, your usage can be affected. So
even if the problem occur from time to time, it can be a hardware
overloading (or deficiency from your perspective).

A short term fix might be to trash this instance after bringing a
replacement node. AWS allows that, Cassandra too. If that's doable, it's
probably a good thing to try. We prefer a down node than a slow node in
Cassandra, so +1 on Chris suggestion there.

Also, this problem vanished (definitively) in my case when switching to an
instance using SSD. I am not sure what you are using, but having more
throughput capacity protects you somewhat from this issue I think.

As a last thought, you can have dedicated servers in AWS as well. I don't
know much about this but believe that would also remove this risk.

Cheers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2017-09-04 17:33 GMT+01:00 Chris Lohfink <clohfink85@gmail.com>:

> nodetool cfstats will help identify any table that has super wide
> partitions, too many tombstones on reads, or an excessive amount of
> sstables. Any of which can cause this, but large partitions and too many
> tombstones *usually* show themselves as too many GCs and the node going
> down/up, or high user cpu more than high iowait cpu.
>
> Are you using EBS or ephemerals? Have you checked iostat or anything with
> disks to make sure they are not going bad? If disk latency spikes or goes
> bad you can end up seeing a node with high iowait. You can debug and get
> which files are causing most IO if its not a disk failure. (ie
> http://bencane.com/2012/08/06/troubleshooting-high-io-wait-in-linux/ )
>
> Might be easiest just to replace that host with a new node and see if it
> fixes it to rule out hardware. This might also fix a problem if that one
> node is way behind in compactions too, just resetting it to be in line with
> other nodes.
>
> Chris
>
> On Sun, Sep 3, 2017 at 10:15 AM, Andrew Bialecki <
> andrew.bialecki@klaviyo.com> wrote:
>
>> Fay, what do you mean by "partition key data is on one node." Shouldn't a
>> write request with RF=3 be fulfillable by any of three nodes?
>>
>> I do think we have a "hot key," we're working on tracking that down.
>>
>> On Sat, Sep 2, 2017 at 11:30 PM, Fay Hou [Storage Service] ­ <
>> fayhou@coupang.com> wrote:
>>
>>> Most likely related to a poor data modeling. The partition key data is
>>> on one node. Checking into the queries and table design
>>>
>>> On Sep 2, 2017 5:48 PM, Andrew Bialecki <andrew.bialecki@klaviyo.com>
>>> wrote:
>>>
>>> We're running Cassandra 3.7 on AWS, different AZs, same region. The
>>> columns are counters and the workload is 95% writes, but of course those
>>> involves a local read and write because their coutners.
>>>
>>> We have a node with much higher CPU load than others under heavy write
>>> volume. That node is at 100% disk utilization / high iowait. The IO load
>>> when looked at with iostat is primarily reads (95%) vs writes in terms of
>>> requests and bytes. Below's a graph of the CPU.
>>>
>>> Any ideas to how we could diagnose what is causing so much IO vs. other
>>> nodes?
>>>
>>> Also, we're not sure why this node in particular is hot the other two
>>> "replica" nodes (we use RF = 3). We're using the DataStax driver and are
>>> looking into the load balancing policy to see if that's an issue.
>>>
>>> [image: Inline image 1]
>>>
>>> --
>>> Andrew Bialecki
>>> Klaviyo
>>>
>>>
>>>
>>
>>
>> --
>> Andrew Bialecki
>>
>> <https://www.klaviyo.com/>
>>
>
>

Mime
View raw message