cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lohfink <clohfin...@gmail.com>
Subject Re: determining the cause of a high CPU / disk util node
Date Mon, 04 Sep 2017 16:33:13 GMT
nodetool cfstats will help identify any table that has super wide
partitions, too many tombstones on reads, or an excessive amount of
sstables. Any of which can cause this, but large partitions and too many
tombstones *usually* show themselves as too many GCs and the node going
down/up, or high user cpu more than high iowait cpu.

Are you using EBS or ephemerals? Have you checked iostat or anything with
disks to make sure they are not going bad? If disk latency spikes or goes
bad you can end up seeing a node with high iowait. You can debug and get
which files are causing most IO if its not a disk failure. (ie
http://bencane.com/2012/08/06/troubleshooting-high-io-wait-in-linux/ )

Might be easiest just to replace that host with a new node and see if it
fixes it to rule out hardware. This might also fix a problem if that one
node is way behind in compactions too, just resetting it to be in line with
other nodes.

Chris

On Sun, Sep 3, 2017 at 10:15 AM, Andrew Bialecki <
andrew.bialecki@klaviyo.com> wrote:

> Fay, what do you mean by "partition key data is on one node." Shouldn't a
> write request with RF=3 be fulfillable by any of three nodes?
>
> I do think we have a "hot key," we're working on tracking that down.
>
> On Sat, Sep 2, 2017 at 11:30 PM, Fay Hou [Storage Service] ­ <
> fayhou@coupang.com> wrote:
>
>> Most likely related to a poor data modeling. The partition key data is on
>> one node. Checking into the queries and table design
>>
>> On Sep 2, 2017 5:48 PM, Andrew Bialecki <andrew.bialecki@klaviyo.com>
>> wrote:
>>
>> We're running Cassandra 3.7 on AWS, different AZs, same region. The
>> columns are counters and the workload is 95% writes, but of course those
>> involves a local read and write because their coutners.
>>
>> We have a node with much higher CPU load than others under heavy write
>> volume. That node is at 100% disk utilization / high iowait. The IO load
>> when looked at with iostat is primarily reads (95%) vs writes in terms of
>> requests and bytes. Below's a graph of the CPU.
>>
>> Any ideas to how we could diagnose what is causing so much IO vs. other
>> nodes?
>>
>> Also, we're not sure why this node in particular is hot the other two
>> "replica" nodes (we use RF = 3). We're using the DataStax driver and are
>> looking into the load balancing policy to see if that's an issue.
>>
>> [image: Inline image 1]
>>
>> --
>> Andrew Bialecki
>> Klaviyo
>>
>>
>>
>
>
> --
> Andrew Bialecki
>
> <https://www.klaviyo.com/>
>

Mime
View raw message