cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reid Pinchback <rpinchb...@tripadvisor.com>
Subject Re: Elevated response times from all nodes in a data center at the same time.
Date Tue, 15 Oct 2019 15:19:54 GMT
I’d look to see if you have compactions fronting the p99’s.  If so, then go back to looking
at the I/O.  Disbelieve any metrics not captured at a high resolution for a time window around
the compactions, like 100ms.  You could be hitting I/O stalls where reads are blocked by the
flushing of writes.  It’s short-lived when it happens, and per-minute metrics won’t provide
breadcrumbs.

From: Bill Walters <billwalters28@gmail.com>
Date: Monday, October 14, 2019 at 7:10 PM
To: <user@cassandra.apache.org>
Subject: Elevated response times from all nodes in a data center at the same time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our production cluster
for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data
centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

1. We have been running in production for more than one year and our experience with Cassandra
is great. Experienced little hiccups here and there but nothing severe.

2. But recently for the past couple of days we see a behavior where our p99 latency in our
AWS us-east-1 region OLTP data center, suddenly starts rising from 2 ms to 200 ms. It starts
with one node where we see the 99th percentile Read Request latency in Datastax Opscenter
starts increasing. And it spreads immediately, to all other 6 nodes in the data center.

3. We do not see any Read request timeouts or Exception in the our API Splunk logs only p99
and average latency go up suddenly.

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for
the nodes during this period and we are not experiencing any sudden surge in these parameters.

5. We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand
which one is bad, but we see all of them responding with very high latency. It doesn't happen
during our peak traffic period sometime in the night.

6. We checked the system.log files on our nodes, took a thread dump and checked for any rouge
processes running on the nodes which is stealing CPU but we are able to find nothing.

7. We even checked our the write requests coming in during this time and we do not see any
large batch operations happening.

8. Initially we tried restarting the nodes to see if the issue can be mitigated but it kept
happening, and we had to fail over API traffic to us-west-2 region OLTP data center. After
a couple of hours we failed back and everything seems to be working.

We are baffled by this behavior, only correlation we find is the "Native requests pending"
in our Task queues when this happens.

Please let us know your suggestions on how to debug this issue. Has anyone experienced an
issue like this before.(We had issues where one node starts acting bad due to bad EBS volume
I/O read and write time, but all nodes experiencing an issue at same time is very peculiar)

Thank You,
Bill Walters.
Mime
View raw message