cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Brotman" <>
Subject RE: Looking for feedback on automated root-cause system
Date Tue, 05 Mar 2019 18:14:53 GMT


Do you anticipate having trouble getting clients to allow the collector to send data up to
your NOC?  Wouldn’t a lot of companies be unable or uneasy about that?


Your ML can only work if it’s got LOTS of data from many different scenarios.  How are you
addressing that?  How are you able to get that much good quality data?


Kenneth Brotman


From: Kenneth Brotman [] 
Sent: Tuesday, March 05, 2019 10:01 AM
To: ''
Subject: RE: Looking for feedback on automated root-cause system


I see they have a website now at



From: Matt Stump [] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system


For some reason responses to the thread didn't hit my work email, I didn't see the responses
until I check from my personal. 


The way that the system works is that we install a collector that pulls a bunch of metrics
from each node and sends it up to our NOC every minute. We've got a bunch of stream processors
that take this data and do a bunch of things with it. We've got some dumb ones that check
for common miss-configurations, bugs etc.. they also populate dashboards and a couple of minimal
graphs. The more intelligent agents take a look at the metrics and they start generating a
bunch of calculated/scaled metrics and events. If one of these triggers a threshold then we
kick off the ML that does classification using the stored data to classify the root cause,
and point you to the correct knowledge base article with remediation steps. Because we've
got he cluster history we can identify a breach, and give you an SLA in about 1 minute. The
goal is to get you from 0 to resolution as quickly as possible. 


We're looking for feedback on the existing system, do these events make sense, do I need to
beef up a knowledge base article, did it classify correctly, or is there some big bug that
everyone is running into that needs to be publicized. We're also looking for where to go next,
which models are going to make your life easier?


The system works for C*, Elastic and Kafka. We'll be doing some blog posts explaining in more
detail how it works and some of the interesting things we've found. For example everything
everyone thought they knew about Cassandra thread pool tuning is wrong, nobody really knows
how to tune Kafka for large messages, or that there are major issues with the Kubernetes charts
that people are using.




On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <> wrote:

Any information you can share on the inputs it needs/uses would be helpful.


Kenneth Brotman


From: daemeon reiydelle [] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system


Welcome to the world of testing predictive analytics. I will pass this on to my folks at Accenture,
know of a couple of C* clients we run, wondering what you had in mind?



Daemeon C.M. Reiydelle


San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle



On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <> wrote:


I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have
worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and
a lot of my peers that have done consulting, support or worked on really big deployments is
that we get burnt out. We fight a lot of the same fires over and over again, and don’t get
to work on new or interesting stuff Also, what we do is really hard to transfer to other people
because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant
that’s able to scale some of this knowledge. We’ve got it to the point where it’s able
to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy
greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain
features and when it does give you specific remediation steps with links to knowledge base


We think we’ve seeded our database with enough root causes that it’ll catch the vast majority
of issues but there is always the possibility that we’ll run into something previously unknown
like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the
product a trial. The process would be a collaboration, where we both get to learn from each
other and improve how we’re doing things.

Matt Stump

View raw message