Something else came to mind.

 

You’re on AWS.  You always have to keep noisy neighbor problems in the back of the mind when you aren’t running on bare metal.

 

Basically, either your usage pattern during these incidents is unchanged… or it is not unchanged.  If it is unchanged, and the problem happens, then something outside your usage is causing the problem.  Hence the noisy neighbor suspicion.  If the usage has changed during that time period, then all bets are off and you need to tune your cluster to be happy under more than one usage profile.

 

From: Jon Haddad <jon@jonhaddad.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, October 16, 2019 at 11:36 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Cc: Bill Walters <billwalters28@gmail.com>
Subject: Re: Elevated response times from all nodes in a data center at the same time.

 

Message from External Sender

It's possible the queries you're normally running are served out of page cache, and during the latency spike you're hitting your disks. If you're using read ahead you might be hitting a throughput limit on the disks.

 

I've got some numbers and graphs I can share later when I'm not on my phone. 

 

Jon

 

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ <arodrime@gmail.com> wrote:

Hello Bill,

 

I think it might be worth it to focus the effort on diagnosing the issue properly in the first place, thus I'll try to guide you through this.

 

First some comments on your environment:

 

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.

No of Nodes: 24

Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data centers for Analytics and Batch loads)

Instance Types: r5.8x Large

Average Node Size: 182 GB

Work Load: Read heavy

 

When I read this, I think 'Tune the garbage collection properly!'. do you see GC pauses being a problem? The easiest way to interpret GC logs is probably to upload them there: https://gceasy.io. Ensure there that the 'GC throughput' is at least around 95+% (ideally 98+%). This would mean that 2 to 5% of the time each node stops to perform stop the world GCs. If that's a thing, we can help you setting GC options a bit nicer than what it is currently probably. That post would then probably be a good starting point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

 

Read TPS: 22k

Cassandra version: 3.0.15

 

Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the time being)  or (personal preference) 3.11.4. There was a bunch of fixes, maybe are you hitting something that was fixed already, check changes there, see if any change could do some good for your use case: https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124

 

Java Version: JDK 181.

EBS Volumes: GP2 with 1TB 3000 iops.

 

The GP2 IOPS depends on the dis size. If you find out at anytime that disks are not keeping up, a good way out could be to increase the disk size (despite the small dataset) to actually increase the disk. IOPS & throughput. Now this did not recently change, and it was working for you until now, thus you don't have to Increase the disk size now probably. Just be aware that GP2 with 1 TB are quite slow.

 

About the issue:

 

our p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts rising from 2 ms to 200 ms. It starts with one node where we see the 99th percentile Read Request latency in Datastax Opscenter starts increasing. And it spreads immediately, to all other 6 nodes in the data center.

 

Well, this sounds quite bad. The first 2 things coming to my mind here are:

- Are you reading tombstones? (check logs for tombstones, trace a few queries)

- Are you reading a huge partition? (check the max partition size, compare it to the mean and ensure it is remaining below 1 GB (or even 100 MB ideally).

 

An inefficient read, for the reasons above or other reasons, would not necessarily impact nodes' resources but could definitely destroy performances for this query and the following ones due to the 'requests congestion'.

 

To try to make a sense of the current tombstones level you can look at:

- logs (grep tombstone)

- sstablemetadata gives you the % of droppable tombstones. This is an estimate and of the space that could be freed, it gives no information on whether tombstones are being read and can affect performances or not, yet it gives you an idea of the tombstones that can be generated in the workflow

- Trace queries: either trace a manual query from cqlsh with 'TRACING ON;' then sending queries similar to prod ones. Or directly using 'nodetool settraceprobability X', /!\ ensure X is really low to start with - like 0.001 or 0.0001 maybe, we probably don't need many queries to understand what happened and tracing might inflicts a big penalty to Cassandra servers in terms of performances (each of the traced queries will induce a bunch of queries to actually persist the trace In system_traces key space.

 

We do not see any Read request timeouts or Exception in the our API Splunk logs only p99 and average latency go up suddenly.

 

What's the value you use for timeouts? Also, any other exception/timeout, somewhere else than for reads?

What are the result of:

 

- nodetool tablestats (I think this would gather what you need to check --> nodetool tablestats | grep -e Keyspace -e Table: -e latency -e partition -e tombstones)

- watch -d nodetool tpstats (here look at any pending threads constantly higher than 0, any blocked or dropped threads)

 

We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for the nodes during this period and we are not experiencing any sudden surge in these parameters. 

 

If the resources are fine, there is a bottleneck within Cassandra, that we need to find, the commands above aim at that, finding C*'s bottleneck, assuming machines can handle more load.

 

We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand which one is bad, but we see all of them responding with very high latency. It doesn't happen during our peak traffic period sometime in the night.

 

 This brings something else to my mind. The fact latency goes lower when there is a traffic increase, it can perfectly mean that each of the queries sent during the spike are really efficient, and despite you might still have some other queries being slow (even during peak hours), having that many 'efficient/quick' requests, lowers the average/pXX latencies. Does the max latency change over time?

 

You can here try to get a sense of this with:

 

- nodetool proxyhistograms

- nodetool tablehistograms <keyspace> <table> # For the table level stats

 

We checked the system.log files on our nodes, took a thread dump and checked for any rouge processes running on the nodes which is stealing CPU but we are able to find nothing.

 

From what I read/understand, resource are fine, I would put these searches aside for now. About the log file, I like to use:

 

- tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are having the issues NOW)

- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what happened and was wrong

 

For now I can't think about anything else, I hope some of those ideas will help you diagnose the problem. Once it is diagnosed, we should be able to reason about how we can fix it.

 

C*heers,

-----------------------

Alain Rodriguez - alain@thelastpickle.com

France / Spain

 

The Last Pickle - Apache Cassandra Consulting

 

Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rpinchback@tripadvisor.com> a écrit :

I’d look to see if you have compactions fronting the p99’s.  If so, then go back to looking at the I/O.  Disbelieve any metrics not captured at a high resolution for a time window around the compactions, like 100ms.  You could be hitting I/O stalls where reads are blocked by the flushing of writes.  It’s short-lived when it happens, and per-minute metrics won’t provide breadcrumbs.

 

From: Bill Walters <billwalters28@gmail.com>
Date: Monday, October 14, 2019 at 7:10 PM
To: <user@cassandra.apache.org>
Subject: Elevated response times from all nodes in a data center at the same time.

 

Hi Everyone,

 

Need some suggestions regarding a peculiar issue we started facing in our production cluster for the last couple of days. 

 

Here are our Production environment details.

 

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.

No of Nodes: 24

Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data centers for Analytics and Batch loads)

Instance Types: r5.8x Large

Average Node Size: 182 GB

Work Load: Read heavy

Read TPS: 22k

Cassandra version: 3.0.15

Java Version: JDK 181.

EBS Volumes: GP2 with 1TB 3000 iops.

 

1. We have been running in production for more than one year and our experience with Cassandra is great. Experienced little hiccups here and there but nothing severe.

 

2. But recently for the past couple of days we see a behavior where our p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts rising from 2 ms to 200 ms. It starts with one node where we see the 99th percentile Read Request latency in Datastax Opscenter starts increasing. And it spreads immediately, to all other 6 nodes in the data center.

 

3. We do not see any Read request timeouts or Exception in the our API Splunk logs only p99 and average latency go up suddenly.

 

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for the nodes during this period and we are not experiencing any sudden surge in these parameters. 

 

5. We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand which one is bad, but we see all of them responding with very high latency. It doesn't happen during our peak traffic period sometime in the night.

 

6. We checked the system.log files on our nodes, took a thread dump and checked for any rouge processes running on the nodes which is stealing CPU but we are able to find nothing.

 

7. We even checked our the write requests coming in during this time and we do not see any large batch operations happening. 

 

8. Initially we tried restarting the nodes to see if the issue can be mitigated but it kept happening, and we had to fail over API traffic to us-west-2 region OLTP data center. After a couple of hours we failed back and everything seems to be working.

 

We are baffled by this behavior, only correlation we find is the "Native requests pending" in our Task queues when this happens. 

 

Please let us know your suggestions on how to debug this issue. Has anyone experienced an issue like this before.(We had issues where one node starts acting bad due to bad EBS volume I/O read and write time, but all nodes experiencing an issue at same time is very peculiar)

 

Thank You,

Bill Walters.