From user-return-63373-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Tue Mar 5 19:47:53 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0FC6C18067C for ; Tue, 5 Mar 2019 20:47:51 +0100 (CET) Received: (qmail 25183 invoked by uid 500); 5 Mar 2019 19:47:45 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 25173 invoked by uid 99); 5 Mar 2019 19:47:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2019 19:47:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B1D13C045F for ; Tue, 5 Mar 2019 19:47:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.202 X-Spam-Level: *** X-Spam-Status: No, score=3.202 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, KAM_LINEPADDING=1.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=vorstella-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id gLcS8OqiK8sn for ; Tue, 5 Mar 2019 19:47:42 +0000 (UTC) Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 66F5761108 for ; Tue, 5 Mar 2019 19:47:41 +0000 (UTC) Received: by mail-pf1-f179.google.com with SMTP id j5so6493203pfa.2 for ; Tue, 05 Mar 2019 11:47:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vorstella-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=LLUkvVRr7L3BXismJmV/+JbQ6tjnUed5pGBBj5Vyb1w=; b=MHEPsMtjw+tFWtvp8WZmIo5Ak/uJzv3MDGvPOL5t8GpoqQe/XV6vEIz8kk/TFvB6SS 79qb/Sd7VxnJFusiz9QiHUp+7VugO7q3KJz5FKHL0sDeT72tTeGc3uiCntYjt2/e53lT saHnpTwiXEKSDvk+Y2h9LxexH65quMX2au111tMaL3gsM+Cn9ZSvHlOP1koNdavr2FXM 6Z9o5Ov71MSYsPmDVSyatrykmWEvO65mqxsqtvdhphoqwyuby921VUB5vmlFU9nv9iBM HcEVdLSX4rK5UhAUqrxSxvSjQ0OauxDPDXaKmYQ5NswiXX/J/LCjlLVD+hQLbRgLhESy uU8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=LLUkvVRr7L3BXismJmV/+JbQ6tjnUed5pGBBj5Vyb1w=; b=S0XSUr0ZYWCSv2R9q98Qm7E7ROoG4IW50dNzdHk6QiQgBN7LCohidXWlGVDPrvnQgi MahiFw+atFHRT/GkXbBoS0IFsEhwMHikAbsWm9Ji/GxrT2X1nLHzpAnhW/xLyyiKjwKJ pLmTi8SwBPbh8Xn1lIAQ/rAPc7cyNWVjMlnBnRom1ySylirH2jecWXIgEx9MiwvwoF23 Y4zoV618gd85c1isNPDY1JxRCRYcFV0YMyHBHT0BDVThK7+OmpPyL5odVIXRUaihmz9q mfNFsXluYzkrm+bWVg+78r9B/sM4sLJStXWu2ll5575s8VpDOAjeQNuOVunew2DS2R7N 2cVg== X-Gm-Message-State: APjAAAXgOnF/+uECzHdpUTROs9UzpzRHZtvBChpA06Yg7LE+LRuwVJlG gb/VLNGtHMC6tMUMfLkCILeDSDnvhTzUXPn8knyImAekcjk= X-Google-Smtp-Source: APXvYqzhjElIV7XFupf/Na0W2+LKcyZSRFr6dmmSHd/la4Dd7/t5hGdWIdbwaXsmQkP0cTTBZix+XqpMjEQiksWmZX4= X-Received: by 2002:a63:d112:: with SMTP id k18mr2931848pgg.426.1551815259312; Tue, 05 Mar 2019 11:47:39 -0800 (PST) MIME-Version: 1.0 References: <000001d4c8b4$cb850d30$628f2790$@yahoo.com> <000001d4d37f$55607880$00216980$@yahoo.com> <000301d4d38a$753a0030$5fae0090$@yahoo.com> In-Reply-To: <000301d4d38a$753a0030$5fae0090$@yahoo.com> From: Matthew Stump Date: Tue, 5 Mar 2019 11:47:03 -0800 Message-ID: Subject: Re: Looking for feedback on automated root-cause system To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="00000000000049e64705835e2598" --00000000000049e64705835e2598 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable We probably will, that'll come soon-ish (a couple of weeks perhaps). Right now we're limited by who we can engage with in order to collect feedback. On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman wrote: > Simulators will never get you there. Why don=E2=80=99t you let everyone = plug in > to the NOC in exchange for standard features or limited scale, make some > money on the big cats that can you can make value proposition attractive > for anyway. You get the data you have to have =E2=80=93 and free; everyo= ne=E2=80=99s > Cassandra cluster get=E2=80=99s smart! > > > > > > *From:* Matthew Stump [mailto:mstump@vorstella.com] > *Sent:* Tuesday, March 05, 2019 11:12 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Looking for feedback on automated root-cause system > > > > Getting people to send data to us can be a little bit of a PITA, but it's > doable. We've got data from regulated/secure environments streaming in. > None of the data we collect is a risk, but the default is to say no and > you've got to overcome that barrier. We've been through the audit a bunch > of times, it gets easier each time because everyone asks more or less the > same questions and requires the same set of disclosures. > > > > Cold start for AI is always an issue but we overcame it via two routes: > > > > We had customers from a pre-existing line of business. We were probably > the first ones to run production Cassandra workloads at scale in k8s. We > funded the work behind the some of the initial blog posts and had to figu= re > out most of the ins-and-outs of making it work. This data is good for > helping to identify edge cases and bugs that you wouldn't normally > encounter, but it's super noisy and you've got to do a lot to isolate > and/or derive value from data in the beginning if you're attempting to do > root cause. > > > > Leveraging the above we built out an extensive simulations pipeline. It > initially started as python scripts targeting k8s, but it's since been > fully automated with Spinnaker. We have a couple of simulations running > all the time doing continuous integration with the models, collectors and > pipeline code, but will burst out to a couple hundred clusters if we need > to test something complicated. It's takes just a couple of minutes to hav= e > it spin up hundreds of different load generators, targeting different > versions of C*, running with different topologies, using clean disks or > restoring from previous snapshots. > > > > As the corpus grows simulations mater less, and it's easier to get signal > from noise in a customer cluster. > > > > On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman > wrote: > > Matt, > > > > Do you anticipate having trouble getting clients to allow the collector t= o > send data up to your NOC? Wouldn=E2=80=99t a lot of companies be unable = or uneasy > about that? > > > > Your ML can only work if it=E2=80=99s got LOTS of data from many differen= t > scenarios. How are you addressing that? How are you able to get that mu= ch > good quality data? > > > > Kenneth Brotman > > > > *From:* Kenneth Brotman [mailto:kenbrotman@yahoo.com] > *Sent:* Tuesday, March 05, 2019 10:01 AM > *To:* 'user@cassandra.apache.org' > *Subject:* RE: Looking for feedback on automated root-cause system > > > > I see they have a website now at https://vorstella.com/ > > > > > > *From:* Matt Stump [mailto:mrevilgnome@gmail.com] > *Sent:* Friday, February 22, 2019 7:56 AM > *To:* user > *Subject:* Re: Looking for feedback on automated root-cause system > > > > For some reason responses to the thread didn't hit my work email, I didn'= t > see the responses until I check from my personal. > > > > The way that the system works is that we install a collector that pulls a > bunch of metrics from each node and sends it up to our NOC every minute. > We've got a bunch of stream processors that take this data and do a bunch > of things with it. We've got some dumb ones that check for common > miss-configurations, bugs etc.. they also populate dashboards and a coupl= e > of minimal graphs. The more intelligent agents take a look at the metrics > and they start generating a bunch of calculated/scaled metrics and events= . > If one of these triggers a threshold then we kick off the ML that does > classification using the stored data to classify the root cause, and poin= t > you to the correct knowledge base article with remediation steps. Because > we've got he cluster history we can identify a breach, and give you an SL= A > in about 1 minute. The goal is to get you from 0 to resolution as quickly > as possible. > > > > We're looking for feedback on the existing system, do these events make > sense, do I need to beef up a knowledge base article, did it classify > correctly, or is there some big bug that everyone is running into that > needs to be publicized. We're also looking for where to go next, which > models are going to make your life easier? > > > > The system works for C*, Elastic and Kafka. We'll be doing some blog post= s > explaining in more detail how it works and some of the interesting things > we've found. For example everything everyone thought they knew about > Cassandra thread pool tuning is wrong, nobody really knows how to tune > Kafka for large messages, or that there are major issues with the > Kubernetes charts that people are using. > > > > > > > > On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman > wrote: > > Any information you can share on the inputs it needs/uses would be helpfu= l. > > > > Kenneth Brotman > > > > *From:* daemeon reiydelle [mailto:daemeonr@gmail.com] > *Sent:* Tuesday, February 19, 2019 4:27 PM > *To:* user > *Subject:* Re: Looking for feedback on automated root-cause system > > > > Welcome to the world of testing predictive analytics. I will pass this on > to my folks at Accenture, know of a couple of C* clients we run, wonderin= g > what you had in mind? > > > > > > *Daemeon C.M. Reiydelle* > > *email: daemeonr@gmail.com * > > *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype > daemeon.c.mreiydelle* > > > > > > On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump > wrote: > > Howdy, > > I=E2=80=99ve been engaged in the Cassandra user community for a long time= , almost > 8 years, and have worked on hundreds of Cassandra deployments. One of the > things I=E2=80=99ve noticed in myself and a lot of my peers that have don= e > consulting, support or worked on really big deployments is that we get > burnt out. We fight a lot of the same fires over and over again, and don= =E2=80=99t > get to work on new or interesting stuff Also, what we do is really hard t= o > transfer to other people because it=E2=80=99s based on experience. > > Over the past year my team and I have been working to overcome that gap, > creating an assistant that=E2=80=99s able to scale some of this knowledge= . We=E2=80=99ve > got it to the point where it=E2=80=99s able to classify known root causes= for an > outage or an SLA breach in Cassandra with an accuracy greater than 90%. I= t > can accurately diagnose bugs, data-modeling issues, or misuse of certain > features and when it does give you specific remediation steps with links = to > knowledge base articles. > > > > We think we=E2=80=99ve seeded our database with enough root causes that i= t=E2=80=99ll > catch the vast majority of issues but there is always the possibility tha= t > we=E2=80=99ll run into something previously unknown like CASSANDRA-11170 = (one of > the issues our system found in the wild). > > We=E2=80=99re looking for feedback and would like to know if anyone is in= terested > in giving the product a trial. The process would be a collaboration, wher= e > we both get to learn from each other and improve how we=E2=80=99re doing = things. > > Thanks, > Matt Stump > > --00000000000049e64705835e2598 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
We probably will, that'll come soon-ish (a couple of w= eeks perhaps). Right now we're limited by who we can engage with in ord= er to collect feedback.

On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman = <kenbrotman@yahoo.com.invalid> wrote:

Simulators will never get you there.=C2=A0 Why don=E2=80=99t you le= t everyone plug in to the NOC in exchange for standard features or limited = scale, make some money on the big cats that can you can make value proposit= ion attractive for anyway.=C2=A0 You get the data you have to have =E2=80= =93 and free; everyone=E2=80=99s Cassandra cluster get=E2=80=99s smart!<= /u>

=C2=A0

=C2=A0

From: Matthew Stump [mailto:mstump@vorstella.com]
Sent:<= /b> Tuesday, March 05, 2019 11:12 AM
To: user@cassandra.apache.org
= Subject: Re: Looking for feedback on automated root-cause system=

=C2=A0

=

Getting people to send data to us can be a little bi= t of a PITA, but it's doable. We've got data from regulated/secure = environments streaming in. None of the data we collect is a risk, but the d= efault is to say no and you've got to overcome that barrier. We've = been through the audit a bunch of times, it gets easier each time because e= veryone asks more or less the same questions and requires the same set of d= isclosures.

=C2= =A0

Cold start for AI is always= an issue but we overcame it via two routes:

=C2=A0

We had customers from a pre-existing line of business. We were probably= the first ones to run production Cassandra workloads at scale in k8s. We f= unded the work behind the some of the initial blog posts and had to figure = out most of the ins-and-outs of making it work. This data is good for helpi= ng to identify edge cases and bugs that you wouldn't normally encounter= , but it's super noisy and you've got to do a lot to isolate and/or= derive value from data in the beginning if you're attempting to do roo= t cause.

=C2=A0

Leveraging the above we built ou= t an extensive simulations pipeline. It initially started as python scripts= targeting k8s, but it's since been fully automated with Spinnaker.=C2= =A0 We have a couple of simulations running all the time doing continuous i= ntegration with the models, collectors and pipeline code, but will burst ou= t to a couple hundred clusters if we need to test something complicated. It= 's takes just a couple of minutes to have it spin up hundreds of differ= ent load generators, targeting different versions of C*, running with diffe= rent topologies, using clean disks or restoring from previous snapshots.=

=C2=A0

As the corpus grows simulations mater less,= and it's easier to get signal from noise in a customer cluster.=

=C2=A0

=

On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman= <kenbrotman@yahoo.com.invalid> wrote:

=

Matt,<= u>

=C2=A0

Do you anticipate having trouble getting clients to allow the collector to= send data up to your NOC?=C2=A0 Wouldn=E2=80=99t a lot of companies be una= ble or uneasy about that?

=C2=A0

Your ML can only work if it=E2=80= =99s got LOTS of data from many different scenarios.=C2=A0 How are you addr= essing that?=C2=A0 How are you able to get that much good quality data?

=C2=A0

Kenneth Brotman

= =C2=A0

= From: Kenneth Brotman [mailto:kenbrotman@yahoo.com]
Sent: Tuesday, March 05, 2019 10:01 AM
To: 'user@cassandra.apache.org'= ;
Subject: RE: Looking for feedback on automated root-cause syste= m

=C2=A0<= u>

I see they= have a website now at https://vorstella.com/

= =C2=A0

=C2=A0

<= p class=3D"MsoNormal">From: Matt Stump [m= ailto:mrevilgnom= e@gmail.com]
Sent: Friday, February 22, 2019 7:56 AM
T= o: user
Subject: Re: Looking for feedback on automated root-c= ause system

=C2=A0=

For some reason responses to the t= hread didn't hit my work email, I didn't see the responses until I = check from my personal.

=C2=A0

The way that the= system works is that we install a collector that pulls a bunch of metrics = from each node and sends it up to our NOC every minute. We've got a bun= ch of stream processors that take this data and do a bunch of things with i= t. We've got some dumb ones that check for common miss-configurations, = bugs etc.. they also populate dashboards and a couple of minimal graphs. Th= e more intelligent agents take a look at the metrics and they start generat= ing a bunch of calculated/scaled metrics and events. If one of these trigge= rs a threshold then we kick off the ML that does classification using the s= tored data to classify the root cause, and point you to the correct knowled= ge base article with remediation steps. Because we've got he cluster hi= story we can identify a breach, and give you an SLA in about 1 minute. The = goal is to get you from 0 to resolution as quickly as possible. <= /u>

=C2=A0

We're looking for feedback on the existing syst= em, do these events make sense, do I need to beef up a knowledge base artic= le, did it classify correctly, or is there some big bug that everyone is ru= nning into that needs to be publicized. We're also looking for where to= go next, which models are going to make your life easier?

=C2=A0

The system works for C*, Elastic and Kafka. We'll be = doing some blog posts explaining in more detail how it works and some of th= e interesting things we've found. For example everything everyone thoug= ht they knew about Cassandra thread pool tuning is wrong, nobody really kno= ws how to tune Kafka for large messages, or that there are major issues wit= h the Kubernetes charts that people are using.

=

=C2=A0

=C2=A0

=C2=A0<= /u>

On Tue, Feb 19, 2019 at 4:40= PM Kenneth Brotman <kenbrotman@yahoo.com.invalid> wrote:

<= div>

Any informa= tion you can share on the inputs it needs/uses would be helpful.<= /u>

=C2=A0=

Kenneth Brotman

=C2=A0

From: daemeon reiydelle [mailto:daemeonr@gmail.com] =
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
<= b>Subject:
Re: Looking for feedback on automated root-cause system

=C2=A0

Welcome to the world of testing predictive analytics= . I will pass this on to my folks at Accenture, know of a couple of C* clie= nts we run, wondering what you had in mind?

<= div>

=C2=A0

=C2=A0

<= div>
<= div>
<= div>

Daemeon C.M. Reiyd= elle

San Francisco 1.415.501.0198/London 44 02= 0 8144 9872/Skype daemeon.c.mreiydelle

=C2=A0

= =C2=A0

On Tue, Feb 19, 20= 19 at 3:35 PM Matthew Stump <mstump@vorstella.com> wrote:

<= blockquote style=3D"border-style:none none none solid;border-width:medium m= edium medium 1pt;padding:0in 0in 0in 6pt;margin:5pt 0in 5pt 4.8pt;border-co= lor:currentcolor currentcolor currentcolor rgb(204,204,204)">

Howdy,

I=E2=80=99ve been engaged in the Cassandra= user community for a long time, almost 8 years, and have worked on hundred= s of Cassandra deployments. One of the things I=E2=80=99ve noticed in mysel= f and a lot of my peers that have done consulting, support or worked on rea= lly big deployments is that we get burnt out. We fight a lot of the same fi= res over and over again, and don=E2=80=99t get to work on new or interestin= g stuff Also, what we do is really hard to transfer to other people because= it=E2=80=99s based on experience.

Over the past year my team and I= have been working to overcome that gap, creating an assistant that=E2=80= =99s able to scale some of this knowledge. We=E2=80=99ve got it to the poin= t where it=E2=80=99s able to classify known root causes for an outage or an= SLA breach in Cassandra with an accuracy greater than 90%. It can accurate= ly diagnose bugs, data-modeling issues, or misuse of certain features and w= hen it does give you specific remediation steps with links to knowledge bas= e articles.

=C2=A0

We think we=E2=80=99ve seeded our database with enough root causes that it= =E2=80=99ll catch the vast majority of issues but there is always the possi= bility that we=E2=80=99ll run into something previously unknown like CASSAN= DRA-11170 (one of the issues our system found in the wild).

We=E2=80= =99re looking for feedback and would like to know if anyone is interested i= n giving the product a trial. The process would be a collaboration, where w= e both get to learn from each other and improve how we=E2=80=99re doing thi= ngs.

Thanks,
Matt Stump

= --00000000000049e64705835e2598--