cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Stump <mst...@vorstella.com>
Subject Experimental results: everything we thought we knew about thread-pools is wrong
Date Wed, 13 Mar 2019 20:04:49 GMT
Howdy,

As a followup to my earlier thread I wanted to share some of our
experimental data using an auto-tuning agent to find optimal thread pool
settings for C*. The results surprised me. It invalidated a lot of previous
thinking about tuning C* and calls into question some of the decisions that
were made in the the 3.x timeline.

https://vorstella.com/blog/autotuning-cassandra-to-reduce-latencies/

I want to flag that some of this research is more than a year old, and the
blog post focuses on the 2.2 branch. We also tested with 3.x, and found
that the results do transfer, but Avinash didn't have those graphs handy
when we went to write it up.

We were inspired by the ottertune blog post. Using that and a couple other
papers I hacked together an agent that's able to spin up multiple Cassandra
clusters in Kubernetes (k8s), run a load generating container against the
cluster, observe results and deploy a new configuration recommended by the
ML.

The hope was that we could find a smooth, multi-dimensional performance
surface that would let us observe a workload and make a settings
recommendation. We tested around 20 tune-able knobs including heap settings
and performed dimensionality reduction techniques to arrive at some
dominate knobs to use in demos.

What we found was that the relationship between many of the configurable
variables is non-linear, and that optimal settings for thread pools is
highly dependent on read/write mix but also request sizes. *We were able to
demonstrate a 43% latency reduction and an 80% throughput increase over
documented best practices.*

Additionally, we found that there is no single good value for MCT and that
the decision to embed a simplistic hard-coded model in 3.x was probably a
mistake.

We used the same model/methodology paired with a Gatling container that
mimicked a customer workload and were able to demonstrate a 2x lift in
throughput and a 60% reduction in latency with a DataStax search customer,
and that these performance gains allowed them to deploy to EBS on AWS
successfully.

We haven't moved this model to production yet, and chose to focus on a
couple other items first. But, if you're interested in engaging further or
have questions about some of our research we'd be happy to engage.

Mime
View raw message