cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Bialecki <andrew.biale...@klaviyo.com>
Subject Re: determining the cause of a high CPU / disk util node
Date Wed, 06 Sep 2017 04:27:28 GMT
Alain / Chris, really appreciate the responses. The thread we're currently
hunting down is we had a node backed by EBS cross the 160 MB/s threshold,
so it seems we've maxed out that. We've been playing with read_ahead_kb to
no avail, so we're going to try RAIDing drives together to increase
throughput. Will report back.

On Mon, Sep 4, 2017 at 12:57 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

> Hi Andrew, I have seen a very similar problems in AWS where regularly a
> node would perform badly for no reason (io-wait) + terrible load.
>
> First thing would be to make sure this node is not having a specifically
> high load for some good reason, imbalanced and really high load, bad
> configuration, reading tombstones, big requests (maybe due to variable
> partition size)... This can be checked using monitoring with per node
> aggregation stats or using nodetool info. Make sure this node is acting in
> a way that is comparable to other nodes. Due to dynamic snitching, this
> node could even be receiving less informations than others.
>
> A possibility is for the hardware to be somehow broken, from time to time,
> a server is unhealthy on AWS :-). Maybe the disk this instance rely on is
> broken.
>
> Then be aware of what is called "noisy neighbours". Basically if using AWS
> you are probably using a VM on a top of a bigger machine, possibly used by
> more people. The disk is shared if using instance stores, even if using
> instance stores. So if your 'neighbours' are having a fun time processing
> data using spark or something like that, your usage can be affected. So
> even if the problem occur from time to time, it can be a hardware
> overloading (or deficiency from your perspective).
>
> A short term fix might be to trash this instance after bringing a
> replacement node. AWS allows that, Cassandra too. If that's doable, it's
> probably a good thing to try. We prefer a down node than a slow node in
> Cassandra, so +1 on Chris suggestion there.
>
> Also, this problem vanished (definitively) in my case when switching to an
> instance using SSD. I am not sure what you are using, but having more
> throughput capacity protects you somewhat from this issue I think.
>
> As a last thought, you can have dedicated servers in AWS as well. I don't
> know much about this but believe that would also remove this risk.
>
> Cheers,
> -----------------------
> Alain Rodriguez - @arodream - alain@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2017-09-04 17:33 GMT+01:00 Chris Lohfink <clohfink85@gmail.com>:
>
>> nodetool cfstats will help identify any table that has super wide
>> partitions, too many tombstones on reads, or an excessive amount of
>> sstables. Any of which can cause this, but large partitions and too many
>> tombstones *usually* show themselves as too many GCs and the node going
>> down/up, or high user cpu more than high iowait cpu.
>>
>> Are you using EBS or ephemerals? Have you checked iostat or anything with
>> disks to make sure they are not going bad? If disk latency spikes or goes
>> bad you can end up seeing a node with high iowait. You can debug and get
>> which files are causing most IO if its not a disk failure. (ie
>> http://bencane.com/2012/08/06/troubleshooting-high-io-wait-in-linux/ )
>>
>> Might be easiest just to replace that host with a new node and see if it
>> fixes it to rule out hardware. This might also fix a problem if that one
>> node is way behind in compactions too, just resetting it to be in line with
>> other nodes.
>>
>> Chris
>>
>> On Sun, Sep 3, 2017 at 10:15 AM, Andrew Bialecki <
>> andrew.bialecki@klaviyo.com> wrote:
>>
>>> Fay, what do you mean by "partition key data is on one node." Shouldn't
>>> a write request with RF=3 be fulfillable by any of three nodes?
>>>
>>> I do think we have a "hot key," we're working on tracking that down.
>>>
>>> On Sat, Sep 2, 2017 at 11:30 PM, Fay Hou [Storage Service] ­ <
>>> fayhou@coupang.com> wrote:
>>>
>>>> Most likely related to a poor data modeling. The partition key data is
>>>> on one node. Checking into the queries and table design
>>>>
>>>> On Sep 2, 2017 5:48 PM, Andrew Bialecki <andrew.bialecki@klaviyo.com>
>>>> wrote:
>>>>
>>>> We're running Cassandra 3.7 on AWS, different AZs, same region. The
>>>> columns are counters and the workload is 95% writes, but of course those
>>>> involves a local read and write because their coutners.
>>>>
>>>> We have a node with much higher CPU load than others under heavy write
>>>> volume. That node is at 100% disk utilization / high iowait. The IO load
>>>> when looked at with iostat is primarily reads (95%) vs writes in terms of
>>>> requests and bytes. Below's a graph of the CPU.
>>>>
>>>> Any ideas to how we could diagnose what is causing so much IO vs. other
>>>> nodes?
>>>>
>>>> Also, we're not sure why this node in particular is hot the other two
>>>> "replica" nodes (we use RF = 3). We're using the DataStax driver and are
>>>> looking into the load balancing policy to see if that's an issue.
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> --
>>>> Andrew Bialecki
>>>> Klaviyo
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Andrew Bialecki
>>>
>>> <https://www.klaviyo.com/>
>>>
>>
>>
>


-- 
Andrew Bialecki

<https://www.klaviyo.com/>

Mime
View raw message