cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reid Pinchback <rpinchb...@tripadvisor.com>
Subject Re: Elevated response times from all nodes in a data center at the same time.
Date Wed, 16 Oct 2019 17:08:52 GMT
Something else came to mind.

You’re on AWS.  You always have to keep noisy neighbor problems in the back of the mind
when you aren’t running on bare metal.

Basically, either your usage pattern during these incidents is unchanged… or it is not unchanged.
 If it is unchanged, and the problem happens, then something outside your usage is causing
the problem.  Hence the noisy neighbor suspicion.  If the usage has changed during that time
period, then all bets are off and you need to tune your cluster to be happy under more than
one usage profile.

From: Jon Haddad <jon@jonhaddad.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, October 16, 2019 at 11:36 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Cc: Bill Walters <billwalters28@gmail.com>
Subject: Re: Elevated response times from all nodes in a data center at the same time.

Message from External Sender
It's possible the queries you're normally running are served out of page cache, and during
the latency spike you're hitting your disks. If you're using read ahead you might be hitting
a throughput limit on the disks.

I've got some numbers and graphs I can share later when I'm not on my phone.

Jon

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ <arodrime@gmail.com<mailto:arodrime@gmail.com>>
wrote:
Hello Bill,

I think it might be worth it to focus the effort on diagnosing the issue properly in the first
place, thus I'll try to guide you through this.

First some comments on your environment:

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data
centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy

When I read this, I think 'Tune the garbage collection properly!'. do you see GC pauses being
a problem? The easiest way to interpret GC logs is probably to upload them there: https://gceasy.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__gceasy.io&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=E4fnrwsKmo92Cdh31ZAXdfZ5oqYTaPb8Y1jkx4Msz3Y&e=>.
Ensure there that the 'GC throughput' is at least around 95+% (ideally 98+%). This would mean
that 2 to 5% of the time each node stops to perform stop the world GCs. If that's a thing,
we can help you setting GC options a bit nicer than what it is currently probably. That post
would then probably be a good starting point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2018_04_11_gc-2Dtuning.html&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=iB2ahtsSVaS3FYtzoStWHEE7k3y5XWtwMHOWzk7zP0k&e=>

Read TPS: 22k
Cassandra version: 3.0.15

Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the time being)  or (personal
preference) 3.11.4. There was a bunch of fixes, maybe are you hitting something that was fixed
already, check changes there, see if any change could do some good for your use case: https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_cassandra-2D3.0.18_CHANGES.txt-23L1-2DL124&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=ffsBZLksIBOhVH4DE1SKKU51kQThXrbCEV6-STcLAiE&e=>

Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

The GP2 IOPS depends on the dis size. If you find out at anytime that disks are not keeping
up, a good way out could be to increase the disk size (despite the small dataset) to actually
increase the disk. IOPS & throughput. Now this did not recently change, and it was working
for you until now, thus you don't have to Increase the disk size now probably. Just be aware
that GP2 with 1 TB are quite slow.

About the issue:

our p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts rising from
2 ms to 200 ms. It starts with one node where we see the 99th percentile Read Request latency
in Datastax Opscenter starts increasing. And it spreads immediately, to all other 6 nodes
in the data center.

Well, this sounds quite bad. The first 2 things coming to my mind here are:
- Are you reading tombstones? (check logs for tombstones, trace a few queries)
- Are you reading a huge partition? (check the max partition size, compare it to the mean
and ensure it is remaining below 1 GB (or even 100 MB ideally).

An inefficient read, for the reasons above or other reasons, would not necessarily impact
nodes' resources but could definitely destroy performances for this query and the following
ones due to the 'requests congestion'.

To try to make a sense of the current tombstones level you can look at:
- logs (grep tombstone)
- sstablemetadata gives you the % of droppable tombstones. This is an estimate and of the
space that could be freed, it gives no information on whether tombstones are being read and
can affect performances or not, yet it gives you an idea of the tombstones that can be generated
in the workflow
- Trace queries: either trace a manual query from cqlsh with 'TRACING ON;' then sending queries
similar to prod ones. Or directly using 'nodetool settraceprobability X', /!\ ensure X is
really low to start with - like 0.001 or 0.0001 maybe, we probably don't need many queries
to understand what happened and tracing might inflicts a big penalty to Cassandra servers
in terms of performances (each of the traced queries will induce a bunch of queries to actually
persist the trace In system_traces key space.

We do not see any Read request timeouts or Exception in the our API Splunk logs only p99 and
average latency go up suddenly.

What's the value you use for timeouts? Also, any other exception/timeout, somewhere else than
for reads?
What are the result of:

- nodetool tablestats (I think this would gather what you need to check --> nodetool tablestats
| grep -e Keyspace -e Table: -e latency -e partition -e tombstones)
- watch -d nodetool tpstats (here look at any pending threads constantly higher than 0, any
blocked or dropped threads)

We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for the
nodes during this period and we are not experiencing any sudden surge in these parameters.

If the resources are fine, there is a bottleneck within Cassandra, that we need to find, the
commands above aim at that, finding C*'s bottleneck, assuming machines can handle more load.

We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand
which one is bad, but we see all of them responding with very high latency. It doesn't happen
during our peak traffic period sometime in the night.

 This brings something else to my mind. The fact latency goes lower when there is a traffic
increase, it can perfectly mean that each of the queries sent during the spike are really
efficient, and despite you might still have some other queries being slow (even during peak
hours), having that many 'efficient/quick' requests, lowers the average/pXX latencies. Does
the max latency change over time?

You can here try to get a sense of this with:

- nodetool proxyhistograms
- nodetool tablehistograms <keyspace> <table> # For the table level stats

We checked the system.log files on our nodes, took a thread dump and checked for any rouge
processes running on the nodes which is stealing CPU but we are able to find nothing.

From what I read/understand, resource are fine, I would put these searches aside for now.
About the log file, I like to use:

- tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are having the issues
NOW)
- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what happened and was
wrong

For now I can't think about anything else, I hope some of those ideas will help you diagnose
the problem. Once it is diagnosed, we should be able to reason about how we can fix it.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<mailto:alain@thelastpickle.com>
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=kapaVkL0EZEzPVTpVk0GNWZFxxsL7aWHrZU-HJvK45I&e=>

Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rpinchback@tripadvisor.com<mailto:rpinchback@tripadvisor.com>>
a écrit :
I’d look to see if you have compactions fronting the p99’s.  If so, then go back to looking
at the I/O.  Disbelieve any metrics not captured at a high resolution for a time window around
the compactions, like 100ms.  You could be hitting I/O stalls where reads are blocked by the
flushing of writes.  It’s short-lived when it happens, and per-minute metrics won’t provide
breadcrumbs.

From: Bill Walters <billwalters28@gmail.com<mailto:billwalters28@gmail.com>>
Date: Monday, October 14, 2019 at 7:10 PM
To: <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Elevated response times from all nodes in a data center at the same time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our production cluster
for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data
centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

1. We have been running in production for more than one year and our experience with Cassandra
is great. Experienced little hiccups here and there but nothing severe.

2. But recently for the past couple of days we see a behavior where our p99 latency in our
AWS us-east-1 region OLTP data center, suddenly starts rising from 2 ms to 200 ms. It starts
with one node where we see the 99th percentile Read Request latency in Datastax Opscenter
starts increasing. And it spreads immediately, to all other 6 nodes in the data center.

3. We do not see any Read request timeouts or Exception in the our API Splunk logs only p99
and average latency go up suddenly.

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for
the nodes during this period and we are not experiencing any sudden surge in these parameters.

5. We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand
which one is bad, but we see all of them responding with very high latency. It doesn't happen
during our peak traffic period sometime in the night.

6. We checked the system.log files on our nodes, took a thread dump and checked for any rouge
processes running on the nodes which is stealing CPU but we are able to find nothing.

7. We even checked our the write requests coming in during this time and we do not see any
large batch operations happening.

8. Initially we tried restarting the nodes to see if the issue can be mitigated but it kept
happening, and we had to fail over API traffic to us-west-2 region OLTP data center. After
a couple of hours we failed back and everything seems to be working.

We are baffled by this behavior, only correlation we find is the "Native requests pending"
in our Task queues when this happens.

Please let us know your suggestions on how to debug this issue. Has anyone experienced an
issue like this before.(We had issues where one node starts acting bad due to bad EBS volume
I/O read and write time, but all nodes experiencing an issue at same time is very peculiar)

Thank You,
Bill Walters.
Mime
View raw message