From user-return-63373-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org  Tue Mar  5 19:47:53 2019
Return-Path: <user-return-63373-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 0FC6C18067C
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  5 Mar 2019 20:47:51 +0100 (CET)
Received: (qmail 25183 invoked by uid 500); 5 Mar 2019 19:47:45 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@cassandra.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@cassandra.apache.org>
List-Post: <mailto:user@cassandra.apache.org>
List-Id: <user.cassandra.apache.org>
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 25173 invoked by uid 99); 5 Mar 2019 19:47:45 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2019 19:47:45 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B1D13C045F
	for <user@cassandra.apache.org>; Tue,  5 Mar 2019 19:47:44 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.202
X-Spam-Level: ***
X-Spam-Status: No, score=3.202 tagged_above=-999 required=6.31
	tests=[AC_DIV_BONANZA=0.001, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1,
	DKIM_VALID=-0.1, HTML_MESSAGE=2, KAM_LINEPADDING=1.2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001,
	RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=vorstella-com.20150623.gappssmtp.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id gLcS8OqiK8sn for <user@cassandra.apache.org>;
	Tue,  5 Mar 2019 19:47:42 +0000 (UTC)
Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 66F5761108
	for <user@cassandra.apache.org>; Tue,  5 Mar 2019 19:47:41 +0000 (UTC)
Received: by mail-pf1-f179.google.com with SMTP id j5so6493203pfa.2
        for <user@cassandra.apache.org>; Tue, 05 Mar 2019 11:47:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=vorstella-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=LLUkvVRr7L3BXismJmV/+JbQ6tjnUed5pGBBj5Vyb1w=;
        b=MHEPsMtjw+tFWtvp8WZmIo5Ak/uJzv3MDGvPOL5t8GpoqQe/XV6vEIz8kk/TFvB6SS
         79qb/Sd7VxnJFusiz9QiHUp+7VugO7q3KJz5FKHL0sDeT72tTeGc3uiCntYjt2/e53lT
         saHnpTwiXEKSDvk+Y2h9LxexH65quMX2au111tMaL3gsM+Cn9ZSvHlOP1koNdavr2FXM
         6Z9o5Ov71MSYsPmDVSyatrykmWEvO65mqxsqtvdhphoqwyuby921VUB5vmlFU9nv9iBM
         HcEVdLSX4rK5UhAUqrxSxvSjQ0OauxDPDXaKmYQ5NswiXX/J/LCjlLVD+hQLbRgLhESy
         uU8Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=LLUkvVRr7L3BXismJmV/+JbQ6tjnUed5pGBBj5Vyb1w=;
        b=S0XSUr0ZYWCSv2R9q98Qm7E7ROoG4IW50dNzdHk6QiQgBN7LCohidXWlGVDPrvnQgi
         MahiFw+atFHRT/GkXbBoS0IFsEhwMHikAbsWm9Ji/GxrT2X1nLHzpAnhW/xLyyiKjwKJ
         pLmTi8SwBPbh8Xn1lIAQ/rAPc7cyNWVjMlnBnRom1ySylirH2jecWXIgEx9MiwvwoF23
         Y4zoV618gd85c1isNPDY1JxRCRYcFV0YMyHBHT0BDVThK7+OmpPyL5odVIXRUaihmz9q
         mfNFsXluYzkrm+bWVg+78r9B/sM4sLJStXWu2ll5575s8VpDOAjeQNuOVunew2DS2R7N
         2cVg==
X-Gm-Message-State: APjAAAXgOnF/+uECzHdpUTROs9UzpzRHZtvBChpA06Yg7LE+LRuwVJlG
	gb/VLNGtHMC6tMUMfLkCILeDSDnvhTzUXPn8knyImAekcjk=
X-Google-Smtp-Source: APXvYqzhjElIV7XFupf/Na0W2+LKcyZSRFr6dmmSHd/la4Dd7/t5hGdWIdbwaXsmQkP0cTTBZix+XqpMjEQiksWmZX4=
X-Received: by 2002:a63:d112:: with SMTP id k18mr2931848pgg.426.1551815259312;
 Tue, 05 Mar 2019 11:47:39 -0800 (PST)
MIME-Version: 1.0
References: <CAEx4Te3m=7rEc1WxsRr8c8VZBa8x8t6-VJ3n0QkYri5-KaXBzA@mail.gmail.com>
 <CAOUOv0F6LrUoYcjn_Nwh5yteR95J6LonM1LYVTGA9W3edYYrrw@mail.gmail.com>
 <000001d4c8b4$cb850d30$628f2790$@yahoo.com> <CACdAd5bVion5j=r3_a1yuq_9R+_QtYAay5X7kKB66PRt=KR=bg@mail.gmail.com>
 <000001d4d37f$55607880$00216980$@yahoo.com> <CAEx4Te1WniigRK_b7_z=oRkh9LF_t5q2KLfiCWd8pN=cLiniPA@mail.gmail.com>
 <000301d4d38a$753a0030$5fae0090$@yahoo.com>
In-Reply-To: <000301d4d38a$753a0030$5fae0090$@yahoo.com>
From: Matthew Stump <mstump@vorstella.com>
Date: Tue, 5 Mar 2019 11:47:03 -0800
Message-ID: <CAEx4Te3Lm=XW=m3hWVbM6X8kAy9XOQV3tmJXnojKJk=1V3g6YQ@mail.gmail.com>
Subject: Re: Looking for feedback on automated root-cause system
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary="00000000000049e64705835e2598"

--00000000000049e64705835e2598
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

We probably will, that'll come soon-ish (a couple of weeks perhaps). Right
now we're limited by who we can engage with in order to collect feedback.

On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman
<kenbrotman@yahoo.com.invalid> wrote:

> Simulators will never get you there.  Why don=E2=80=99t you let everyone =
plug in
> to the NOC in exchange for standard features or limited scale, make some
> money on the big cats that can you can make value proposition attractive
> for anyway.  You get the data you have to have =E2=80=93 and free; everyo=
ne=E2=80=99s
> Cassandra cluster get=E2=80=99s smart!
>
>
>
>
>
> *From:* Matthew Stump [mailto:mstump@vorstella.com]
> *Sent:* Tuesday, March 05, 2019 11:12 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Getting people to send data to us can be a little bit of a PITA, but it's
> doable. We've got data from regulated/secure environments streaming in.
> None of the data we collect is a risk, but the default is to say no and
> you've got to overcome that barrier. We've been through the audit a bunch
> of times, it gets easier each time because everyone asks more or less the
> same questions and requires the same set of disclosures.
>
>
>
> Cold start for AI is always an issue but we overcame it via two routes:
>
>
>
> We had customers from a pre-existing line of business. We were probably
> the first ones to run production Cassandra workloads at scale in k8s. We
> funded the work behind the some of the initial blog posts and had to figu=
re
> out most of the ins-and-outs of making it work. This data is good for
> helping to identify edge cases and bugs that you wouldn't normally
> encounter, but it's super noisy and you've got to do a lot to isolate
> and/or derive value from data in the beginning if you're attempting to do
> root cause.
>
>
>
> Leveraging the above we built out an extensive simulations pipeline. It
> initially started as python scripts targeting k8s, but it's since been
> fully automated with Spinnaker.  We have a couple of simulations running
> all the time doing continuous integration with the models, collectors and
> pipeline code, but will burst out to a couple hundred clusters if we need
> to test something complicated. It's takes just a couple of minutes to hav=
e
> it spin up hundreds of different load generators, targeting different
> versions of C*, running with different topologies, using clean disks or
> restoring from previous snapshots.
>
>
>
> As the corpus grows simulations mater less, and it's easier to get signal
> from noise in a customer cluster.
>
>
>
> On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman
> <kenbrotman@yahoo.com.invalid> wrote:
>
> Matt,
>
>
>
> Do you anticipate having trouble getting clients to allow the collector t=
o
> send data up to your NOC?  Wouldn=E2=80=99t a lot of companies be unable =
or uneasy
> about that?
>
>
>
> Your ML can only work if it=E2=80=99s got LOTS of data from many differen=
t
> scenarios.  How are you addressing that?  How are you able to get that mu=
ch
> good quality data?
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kenneth Brotman [mailto:kenbrotman@yahoo.com]
> *Sent:* Tuesday, March 05, 2019 10:01 AM
> *To:* 'user@cassandra.apache.org'
> *Subject:* RE: Looking for feedback on automated root-cause system
>
>
>
> I see they have a website now at https://vorstella.com/
>
>
>
>
>
> *From:* Matt Stump [mailto:mrevilgnome@gmail.com]
> *Sent:* Friday, February 22, 2019 7:56 AM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> For some reason responses to the thread didn't hit my work email, I didn'=
t
> see the responses until I check from my personal.
>
>
>
> The way that the system works is that we install a collector that pulls a
> bunch of metrics from each node and sends it up to our NOC every minute.
> We've got a bunch of stream processors that take this data and do a bunch
> of things with it. We've got some dumb ones that check for common
> miss-configurations, bugs etc.. they also populate dashboards and a coupl=
e
> of minimal graphs. The more intelligent agents take a look at the metrics
> and they start generating a bunch of calculated/scaled metrics and events=
.
> If one of these triggers a threshold then we kick off the ML that does
> classification using the stored data to classify the root cause, and poin=
t
> you to the correct knowledge base article with remediation steps. Because
> we've got he cluster history we can identify a breach, and give you an SL=
A
> in about 1 minute. The goal is to get you from 0 to resolution as quickly
> as possible.
>
>
>
> We're looking for feedback on the existing system, do these events make
> sense, do I need to beef up a knowledge base article, did it classify
> correctly, or is there some big bug that everyone is running into that
> needs to be publicized. We're also looking for where to go next, which
> models are going to make your life easier?
>
>
>
> The system works for C*, Elastic and Kafka. We'll be doing some blog post=
s
> explaining in more detail how it works and some of the interesting things
> we've found. For example everything everyone thought they knew about
> Cassandra thread pool tuning is wrong, nobody really knows how to tune
> Kafka for large messages, or that there are major issues with the
> Kubernetes charts that people are using.
>
>
>
>
>
>
>
> On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman
> <kenbrotman@yahoo.com.invalid> wrote:
>
> Any information you can share on the inputs it needs/uses would be helpfu=
l.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* daemeon reiydelle [mailto:daemeonr@gmail.com]
> *Sent:* Tuesday, February 19, 2019 4:27 PM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Welcome to the world of testing predictive analytics. I will pass this on
> to my folks at Accenture, know of a couple of C* clients we run, wonderin=
g
> what you had in mind?
>
>
>
>
>
> *Daemeon C.M. Reiydelle*
>
> *email: daemeonr@gmail.com <daemeonr@gmail.com>*
>
> *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
> daemeon.c.mreiydelle*
>
>
>
>
>
> On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <mstump@vorstella.com>
> wrote:
>
> Howdy,
>
> I=E2=80=99ve been engaged in the Cassandra user community for a long time=
, almost
> 8 years, and have worked on hundreds of Cassandra deployments. One of the
> things I=E2=80=99ve noticed in myself and a lot of my peers that have don=
e
> consulting, support or worked on really big deployments is that we get
> burnt out. We fight a lot of the same fires over and over again, and don=
=E2=80=99t
> get to work on new or interesting stuff Also, what we do is really hard t=
o
> transfer to other people because it=E2=80=99s based on experience.
>
> Over the past year my team and I have been working to overcome that gap,
> creating an assistant that=E2=80=99s able to scale some of this knowledge=
. We=E2=80=99ve
> got it to the point where it=E2=80=99s able to classify known root causes=
 for an
> outage or an SLA breach in Cassandra with an accuracy greater than 90%. I=
t
> can accurately diagnose bugs, data-modeling issues, or misuse of certain
> features and when it does give you specific remediation steps with links =
to
> knowledge base articles.
>
>
>
> We think we=E2=80=99ve seeded our database with enough root causes that i=
t=E2=80=99ll
> catch the vast majority of issues but there is always the possibility tha=
t
> we=E2=80=99ll run into something previously unknown like CASSANDRA-11170 =
(one of
> the issues our system found in the wild).
>
> We=E2=80=99re looking for feedback and would like to know if anyone is in=
terested
> in giving the product a trial. The process would be a collaboration, wher=
e
> we both get to learn from each other and improve how we=E2=80=99re doing =
things.
>
> Thanks,
> Matt Stump
>
>

--00000000000049e64705835e2598
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">We probably will, that&#39;ll come soon-ish (a couple of w=
eeks perhaps). Right now we&#39;re limited by who we can engage with in ord=
er to collect feedback.<br></div><br><div class=3D"gmail_quote"><div dir=3D=
"ltr" class=3D"gmail_attr">On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman =
&lt;kenbrotman@yahoo.com.invalid&gt; wrote:<br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex"><div lang=3D"EN-US"><div class=3D"gmail-m_79898=
16159953202822WordSection1"><p class=3D"MsoNormal"><span style=3D"font-size=
:11pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,7=
3,125)">Simulators will never get you there.=C2=A0 Why don=E2=80=99t you le=
t everyone plug in to the NOC in exchange for standard features or limited =
scale, make some money on the big cats that can you can make value proposit=
ion attractive for anyway.=C2=A0 You get the data you have to have =E2=80=
=93 and free; everyone=E2=80=99s Cassandra cluster get=E2=80=99s smart!<u><=
/u><u></u></span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;f=
ont-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)"=
><u></u>=C2=A0<u></u></span></p><p class=3D"MsoNormal"><span style=3D"font-=
size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(=
31,73,125)"><u></u>=C2=A0<u></u></span></p><p class=3D"MsoNormal"><b><span =
style=3D"font-size:10pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quo=
t;">From:</span></b><span style=3D"font-size:10pt;font-family:&quot;Tahoma&=
quot;,&quot;sans-serif&quot;"> Matthew Stump [mailto:<a href=3D"mailto:mstu=
mp@vorstella.com" target=3D"_blank">mstump@vorstella.com</a>] <br><b>Sent:<=
/b> Tuesday, March 05, 2019 11:12 AM<br><b>To:</b> <a href=3D"mailto:user@c=
assandra.apache.org" target=3D"_blank">user@cassandra.apache.org</a><br><b>=
Subject:</b> Re: Looking for feedback on automated root-cause system<u></u>=
<u></u></span></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><div><div>=
<p class=3D"MsoNormal">Getting people to send data to us can be a little bi=
t of a PITA, but it&#39;s doable. We&#39;ve got data from regulated/secure =
environments streaming in. None of the data we collect is a risk, but the d=
efault is to say no and you&#39;ve got to overcome that barrier. We&#39;ve =
been through the audit a bunch of times, it gets easier each time because e=
veryone asks more or less the same questions and requires the same set of d=
isclosures.<u></u><u></u></p></div><div><p class=3D"MsoNormal"><u></u>=C2=
=A0<u></u></p></div><div><p class=3D"MsoNormal">Cold start for AI is always=
 an issue but we overcame it via two routes:<u></u><u></u></p></div><div><p=
 class=3D"MsoNormal"><u></u>=C2=A0<u></u></p></div><div><p class=3D"MsoNorm=
al">We had customers from a pre-existing line of business. We were probably=
 the first ones to run production Cassandra workloads at scale in k8s. We f=
unded the work behind the some of the initial blog posts and had to figure =
out most of the ins-and-outs of making it work. This data is good for helpi=
ng to identify edge cases and bugs that you wouldn&#39;t normally encounter=
, but it&#39;s super noisy and you&#39;ve got to do a lot to isolate and/or=
 derive value from data in the beginning if you&#39;re attempting to do roo=
t cause.<u></u><u></u></p></div><div><p class=3D"MsoNormal"><u></u>=C2=A0<u=
></u></p></div><div><p class=3D"MsoNormal">Leveraging the above we built ou=
t an extensive simulations pipeline. It initially started as python scripts=
 targeting k8s, but it&#39;s since been fully automated with Spinnaker.=C2=
=A0 We have a couple of simulations running all the time doing continuous i=
ntegration with the models, collectors and pipeline code, but will burst ou=
t to a couple hundred clusters if we need to test something complicated. It=
&#39;s takes just a couple of minutes to have it spin up hundreds of differ=
ent load generators, targeting different versions of C*, running with diffe=
rent topologies, using clean disks or restoring from previous snapshots.<u>=
</u><u></u></p></div><div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p></=
div><div><p class=3D"MsoNormal">As the corpus grows simulations mater less,=
 and it&#39;s easier to get signal from noise in a customer cluster.<u></u>=
<u></u></p></div></div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><div>=
<div><p class=3D"MsoNormal">On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman=
 &lt;kenbrotman@yahoo.com.invalid&gt; wrote:<u></u><u></u></p></div><blockq=
uote style=3D"border-color:currentcolor currentcolor currentcolor rgb(204,2=
04,204);border-style:none none none solid;border-width:medium medium medium=
 1pt;padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in"><div><div>=
<p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:&quot;Cali=
bri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)">Matt,</span><u></u><=
u></u></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:=
&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)">=C2=A0</sp=
an><u></u><u></u></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;f=
ont-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)"=
>Do you anticipate having trouble getting clients to allow the collector to=
 send data up to your NOC?=C2=A0 Wouldn=E2=80=99t a lot of companies be una=
ble or uneasy about that?</span><u></u><u></u></p><p class=3D"MsoNormal"><s=
pan style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-seri=
f&quot;;color:rgb(31,73,125)">=C2=A0</span><u></u><u></u></p><p class=3D"Ms=
oNormal"><span style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quo=
t;sans-serif&quot;;color:rgb(31,73,125)">Your ML can only work if it=E2=80=
=99s got LOTS of data from many different scenarios.=C2=A0 How are you addr=
essing that?=C2=A0 How are you able to get that much good quality data?</sp=
an><u></u><u></u></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;f=
ont-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)"=
>=C2=A0</span><u></u><u></u></p><p class=3D"MsoNormal"><span style=3D"font-=
size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(=
31,73,125)">Kenneth Brotman</span><u></u><u></u></p><p class=3D"MsoNormal">=
<span style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-se=
rif&quot;;color:rgb(31,73,125)">=C2=A0</span><u></u><u></u></p><div><div st=
yle=3D"border-style:solid none none;border-width:1pt medium medium;padding:=
3pt 0in 0in;border-color:currentcolor"><p class=3D"MsoNormal"><b><span styl=
e=3D"font-size:10pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">=
From:</span></b><span style=3D"font-size:10pt;font-family:&quot;Tahoma&quot=
;,&quot;sans-serif&quot;"> Kenneth Brotman [mailto:<a href=3D"mailto:kenbro=
tman@yahoo.com" target=3D"_blank">kenbrotman@yahoo.com</a>] <br><b>Sent:</b=
> Tuesday, March 05, 2019 10:01 AM<br><b>To:</b> &#39;<a href=3D"mailto:use=
r@cassandra.apache.org" target=3D"_blank">user@cassandra.apache.org</a>&#39=
;<br><b>Subject:</b> RE: Looking for feedback on automated root-cause syste=
m</span><u></u><u></u></p></div></div><p class=3D"MsoNormal">=C2=A0<u></u><=
u></u></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:=
&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)">I see they=
 have a website now at <a href=3D"https://vorstella.com/" target=3D"_blank"=
>https://vorstella.com/</a></span><u></u><u></u></p><p class=3D"MsoNormal">=
<span style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-se=
rif&quot;;color:rgb(31,73,125)">=C2=A0</span><u></u><u></u></p><p class=3D"=
MsoNormal"><span style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&q=
uot;sans-serif&quot;;color:rgb(31,73,125)">=C2=A0</span><u></u><u></u></p><=
p class=3D"MsoNormal"><b><span style=3D"font-size:10pt;font-family:&quot;Ta=
homa&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-size=
:10pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Matt Stump [m=
ailto:<a href=3D"mailto:mrevilgnome@gmail.com" target=3D"_blank">mrevilgnom=
e@gmail.com</a>] <br><b>Sent:</b> Friday, February 22, 2019 7:56 AM<br><b>T=
o:</b> user<br><b>Subject:</b> Re: Looking for feedback on automated root-c=
ause system</span><u></u><u></u></p><p class=3D"MsoNormal">=C2=A0<u></u><u>=
</u></p><div><div><p class=3D"MsoNormal">For some reason responses to the t=
hread didn&#39;t hit my work email, I didn&#39;t see the responses until I =
check from my personal. <u></u><u></u></p></div><div><p class=3D"MsoNormal"=
>=C2=A0<u></u><u></u></p></div><div><p class=3D"MsoNormal">The way that the=
 system works is that we install a collector that pulls a bunch of metrics =
from each node and sends it up to our NOC every minute. We&#39;ve got a bun=
ch of stream processors that take this data and do a bunch of things with i=
t. We&#39;ve got some dumb ones that check for common miss-configurations, =
bugs etc.. they also populate dashboards and a couple of minimal graphs. Th=
e more intelligent agents take a look at the metrics and they start generat=
ing a bunch of calculated/scaled metrics and events. If one of these trigge=
rs a threshold then we kick off the ML that does classification using the s=
tored data to classify the root cause, and point you to the correct knowled=
ge base article with remediation steps. Because we&#39;ve got he cluster hi=
story we can identify a breach, and give you an SLA in about 1 minute. The =
goal is to get you from 0 to resolution as quickly as possible. <u></u><u><=
/u></p></div><div><p class=3D"MsoNormal">=C2=A0<u></u><u></u></p></div><div=
><p class=3D"MsoNormal">We&#39;re looking for feedback on the existing syst=
em, do these events make sense, do I need to beef up a knowledge base artic=
le, did it classify correctly, or is there some big bug that everyone is ru=
nning into that needs to be publicized. We&#39;re also looking for where to=
 go next, which models are going to make your life easier?<u></u><u></u></p=
></div><div><p class=3D"MsoNormal">=C2=A0<u></u><u></u></p></div><div><p cl=
ass=3D"MsoNormal">The system works for C*, Elastic and Kafka. We&#39;ll be =
doing some blog posts explaining in more detail how it works and some of th=
e interesting things we&#39;ve found. For example everything everyone thoug=
ht they knew about Cassandra thread pool tuning is wrong, nobody really kno=
ws how to tune Kafka for large messages, or that there are major issues wit=
h the Kubernetes charts that people are using.<u></u><u></u></p></div><div>=
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p></div><div><p class=3D"MsoNo=
rmal">=C2=A0<u></u><u></u></p></div></div><p class=3D"MsoNormal">=C2=A0<u><=
/u><u></u></p><div><div><p class=3D"MsoNormal">On Tue, Feb 19, 2019 at 4:40=
 PM Kenneth Brotman &lt;kenbrotman@yahoo.com.invalid&gt; wrote:<u></u><u></=
u></p></div><blockquote style=3D"border-style:none none none solid;border-w=
idth:medium medium medium 1pt;padding:0in 0in 0in 6pt;margin:5pt 0in 5pt 4.=
8pt;border-color:currentcolor currentcolor currentcolor rgb(204,204,204)"><=
div><div><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:&=
quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)">Any informa=
tion you can share on the inputs it needs/uses would be helpful.</span><u><=
/u><u></u></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-fam=
ily:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,125)">=C2=A0=
</span><u></u><u></u></p><p class=3D"MsoNormal"><span style=3D"font-size:11=
pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:rgb(31,73,1=
25)">Kenneth Brotman</span><u></u><u></u></p><p class=3D"MsoNormal"><span s=
tyle=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quo=
t;;color:rgb(31,73,125)">=C2=A0</span><u></u><u></u></p><p class=3D"MsoNorm=
al"><b><span style=3D"font-size:10pt;font-family:&quot;Tahoma&quot;,&quot;s=
ans-serif&quot;">From:</span></b><span style=3D"font-size:10pt;font-family:=
&quot;Tahoma&quot;,&quot;sans-serif&quot;"> daemeon reiydelle [mailto:<a hr=
ef=3D"mailto:daemeonr@gmail.com" target=3D"_blank">daemeonr@gmail.com</a>] =
<br><b>Sent:</b> Tuesday, February 19, 2019 4:27 PM<br><b>To:</b> user<br><=
b>Subject:</b> Re: Looking for feedback on automated root-cause system</spa=
n><u></u><u></u></p><p class=3D"MsoNormal">=C2=A0<u></u><u></u></p><div><di=
v><p class=3D"MsoNormal"><span style=3D"font-family:&quot;Comic Sans MS&quo=
t;;color:rgb(7,55,99)">Welcome to the world of testing predictive analytics=
. I will pass this on to my folks at Accenture, know of a couple of C* clie=
nts we run, wondering what you had in mind?</span><u></u><u></u></p></div><=
div><p class=3D"MsoNormal"><span style=3D"font-family:&quot;Comic Sans MS&q=
uot;;color:rgb(7,55,99)">=C2=A0</span><u></u><u></u></p></div><p class=3D"M=
soNormal">=C2=A0<u></u><u></u></p><div><div><div><div><div><div><div><div><=
div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><=
div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><=
div><div><div><div><div><div><div><div><div><div><div><div><div><p class=3D=
"MsoNormal"><b><span style=3D"font-family:&quot;Comic Sans MS&quot;;color:r=
gb(56,118,29);background:white none repeat scroll 0% 0%">Daemeon C.M. Reiyd=
elle</span></b><u></u><u></u></p></div><div><p class=3D"MsoNormal"><b><span=
 style=3D"font-family:&quot;Comic Sans MS&quot;;color:rgb(56,118,29);backgr=
ound:white none repeat scroll 0% 0%">email: <a href=3D"mailto:daemeonr@gmai=
l.com" target=3D"_blank">daemeonr@gmail.com</a></span></b><u></u><u></u></p=
></div><div><p class=3D"MsoNormal" style=3D"margin-bottom:12pt"><b><span st=
yle=3D"font-family:&quot;Comic Sans MS&quot;;color:rgb(56,118,29);backgroun=
d:white none repeat scroll 0% 0%">San Francisco 1.415.501.0198/London 44 02=
0 8144 9872/Skype daemeon.c.mreiydelle</span></b><u></u><u></u></p></div></=
div></div></div></div></div></div></div></div></div></div></div></div></div=
></div></div></div></div></div></div></div></div></div></div></div></div></=
div></div></div></div></div></div></div></div></div></div></div></div></div=
></div></div></div></div></div></div></div></div></div></div></div></div><p=
 class=3D"MsoNormal">=C2=A0<u></u><u></u></p></div><p class=3D"MsoNormal">=
=C2=A0<u></u><u></u></p><div><div><p class=3D"MsoNormal">On Tue, Feb 19, 20=
19 at 3:35 PM Matthew Stump &lt;<a href=3D"mailto:mstump@vorstella.com" tar=
get=3D"_blank">mstump@vorstella.com</a>&gt; wrote:<u></u><u></u></p></div><=
blockquote style=3D"border-style:none none none solid;border-width:medium m=
edium medium 1pt;padding:0in 0in 0in 6pt;margin:5pt 0in 5pt 4.8pt;border-co=
lor:currentcolor currentcolor currentcolor rgb(204,204,204)"><div><div><p c=
lass=3D"MsoNormal">Howdy,<br><br>I=E2=80=99ve been engaged in the Cassandra=
 user community for a long time, almost 8 years, and have worked on hundred=
s of Cassandra deployments. One of the things I=E2=80=99ve noticed in mysel=
f and a lot of my peers that have done consulting, support or worked on rea=
lly big deployments is that we get burnt out. We fight a lot of the same fi=
res over and over again, and don=E2=80=99t get to work on new or interestin=
g stuff Also, what we do is really hard to transfer to other people because=
 it=E2=80=99s based on experience. <br><br>Over the past year my team and I=
 have been working to overcome that gap, creating an assistant that=E2=80=
=99s able to scale some of this knowledge. We=E2=80=99ve got it to the poin=
t where it=E2=80=99s able to classify known root causes for an outage or an=
 SLA breach in Cassandra with an accuracy greater than 90%. It can accurate=
ly diagnose bugs, data-modeling issues, or misuse of certain features and w=
hen it does give you specific remediation steps with links to knowledge bas=
e articles. <u></u><u></u></p></div><div><p class=3D"MsoNormal">=C2=A0<u></=
u><u></u></p></div><div><p class=3D"MsoNormal" style=3D"margin-bottom:12pt"=
>We think we=E2=80=99ve seeded our database with enough root causes that it=
=E2=80=99ll catch the vast majority of issues but there is always the possi=
bility that we=E2=80=99ll run into something previously unknown like CASSAN=
DRA-11170 (one of the issues our system found in the wild).<br><br>We=E2=80=
=99re looking for feedback and would like to know if anyone is interested i=
n giving the product a trial. The process would be a collaboration, where w=
e both get to learn from each other and improve how we=E2=80=99re doing thi=
ngs.<br><br>Thanks,<br>Matt Stump<u></u><u></u></p></div></div></blockquote=
></div></div></div></blockquote></div></div></div></blockquote></div></div>=
</div></blockquote></div>

--00000000000049e64705835e2598--