Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D57E8200B72 for ; Fri, 26 Aug 2016 14:22:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D4214160AB6; Fri, 26 Aug 2016 12:22:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AEAC3160AB0 for ; Fri, 26 Aug 2016 14:22:07 +0200 (CEST) Received: (qmail 51578 invoked by uid 500); 26 Aug 2016 12:22:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 51568 invoked by uid 99); 26 Aug 2016 12:22:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Aug 2016 12:22:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id C45FDC09ED for ; Fri, 26 Aug 2016 12:22:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.299 X-Spam-Level: * X-Spam-Status: No, score=1.299 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=foundev-pro.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id CLUehSJViK2A for ; Fri, 26 Aug 2016 12:22:03 +0000 (UTC) Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 172965F30B for ; Fri, 26 Aug 2016 12:22:03 +0000 (UTC) Received: by mail-pa0-f51.google.com with SMTP id fi15so26997106pac.1 for ; Fri, 26 Aug 2016 05:22:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foundev-pro.20150623.gappssmtp.com; s=20150623; h=date:from:to:message-id:in-reply-to:references:subject:mime-version; bh=CMMJhx80j++O5za37EV8YvDJyu0jG7K+ss1gsObSjJs=; b=XDgN15REoGIEL8m7Pxhp9tZCpzPe3KKMoNnZOIdc5LrIO/m2rVFo+GS5d8XIePB+Xu f6ATt9iEUzUk3M+wJM78+EvtNYExTQiNBbyPH11hQzRGVYmm3mJmjoxeTaiCoS516x+k jpq+MhO91Pi23fZX7MWhaasGiVHBwsR+vASRIUxMJddREkfYenQMskxJRZIfeQcv37gr krsz6EPhzX6WZMeDbM3z8lqZlGDzSCVMteLsJQ6xEW0k7BOD2KDOv+1ehRHmjxy93BDg jx6PTINz7mpuOQGUqfDpWU1ju7egajdcDZJ4+GNaHHsvHmczdSj/4EW/isvwnX5XUFEN IP9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:message-id:in-reply-to:references :subject:mime-version; bh=CMMJhx80j++O5za37EV8YvDJyu0jG7K+ss1gsObSjJs=; b=lRY/XlSVeifC834Fkn11Le3kz5H9JypWP/RaFiJ3kZMn/x0i3otArlRMjPQml2fuym gVjrO07OIR2t/+2w1WyN+2qsCK24+81pHrohkIIQA/7Ms0Pjm0n1hovyxIJBaq1c7jsF F5qgR8q6K3KcVl7UdGHnLwrniCh3qN8viQrIHZwz3XCeA0jLYWhdNBZDfnB67FU79dhH 83VF7Be0dQrp3c3iAwuPLXYWhfrxrFWToW2LQigtBOOEWVzAuZmlZXcGjeLcGM/lnfRp tmdA+Q3XpnWZv+wTGJKzbPDaTOOTnpLMJK2Z5NZ2+T8LIbd0cwX9dzuPIVTjRMpspeDq i80Q== X-Gm-Message-State: AE9vXwPwgIF0IET/ofWK49kwyMIkbV10FftbQh8V6jj+YxirLlgUPjndQQMS697u1MvfWQ== X-Received: by 10.66.0.202 with SMTP id 10mr5596062pag.129.1472214115486; Fri, 26 Aug 2016 05:21:55 -0700 (PDT) Received: from mail.outlook.com (ec2-52-25-162-19.us-west-2.compute.amazonaws.com. [52.25.162.19]) by smtp.gmail.com with ESMTPSA id m65sm28443591pfg.79.2016.08.26.05.21.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 26 Aug 2016 05:21:54 -0700 (PDT) Date: Fri, 26 Aug 2016 12:21:53 +0000 (UTC) From: Ryan Svihla To: user@cassandra.apache.org Message-ID: <7E0D86054CA325B1.369E1EC0-624A-4266-B450-256912048D88@mail.outlook.com> In-Reply-To: <156c6404b81.cf02d4b012291.5123474829778468850@zoho.com> References: <156c6404b81.cf02d4b012291.5123474829778468850@zoho.com> Subject: Re: Guidelines for configuring Thresholds for Cassandra metrics MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1214_1865296321.1472214113890" X-Mailer: Outlook for iOS and Android archived-at: Fri, 26 Aug 2016 12:22:09 -0000 ------=_Part_1214_1865296321.1472214113890 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thomas, Not all metrics are KPIs and are only useful when researching a specific is= sue or after a use case specific threshold has been set. The main "canaries" I monitor are:* Pending compactions (dependent on the c= ompaction strategy chosen but 1000 is a sign of severe issues in all cases)= * dropped mutations (more than one I treat as a event to investigate, I bel= ieve in allowing operational overhead and any evidence of load shedding sug= gests I may not have as much as I thought)* blocked anything (flush writers= , etc..more than one I investigate)* system hints ( More than 1k I investig= ate)* heap usage and gc time vary a lot by use case and collector chosen, I= aim for below 65% usage as an average with g1, but this again varies by us= e case a great deal. Sometimes I just looks the chart and query patterns an= d if they don't line up I have to do other deeper investigations* read and = write latencies exceeding SLA is also use case dependent. Those that have n= one I tend to push towards p99 with a middle end SSD based system having 10= 0ms and a spindle based system having 600ms with CL one and assuming a "typ= ical" query pattern (again query patterns and CL so vary here)* cell count = and partition size vary greatly by hardware and gc tuning but I like to in = the absence of all other relevant information like to keep cell count for a= partition below 100k and size below 100mb. I however have many successful = use cases running more and I've had some fail well before that. Hardware an= d tuning tradeoff a shift this around a lot.There is unfortunately as you'l= l note a lot of nuance and the load out really changes what looks right (do= wn to the model of SSDs I have different expectations for p99s if it's a mo= del I haven't used before I'll do some comparative testing). The reason so much of this is general and vague is my selection bias. I'm b= rought in when people are complaining about performance or some grand syste= mic crash because they were monitoring nothing. I have little ability to ch= ange hardware initially so I have to be willing to allow the hardware to do= the best it can an establish levels where it can no longer keep up with th= e customers goals. This may mean for some use cases 10 pending compactions = is an actionable event for them, for another customer 100 is. The better ap= proach is to establish a baseline for when these metrics start to indicate = a serious issue is occurring in that particular app. Basically when people = notice a problem, what did these numbers look like in the minutes, hours an= d days prior? That's the way to establish the levels consistently. Regards, Ryan Svihla On Fri, Aug 26, 2016 at 4:48 AM -0500, "Thomas Julian" wrote: Hello, I am working on setting up a monitoring tool to monitor Cassandra Instances= . Are there any wikis which specifies optimum value for each Cassandra KPIs= ? For instance, I am not sure, What value of "Memtable Columns Count" can be considered as "Normal".=C2=A0 What value of the same has to be considered as "Critical". I knew threshold numbers for few params, for instance any thing more than z= ero for timeouts, pending tasks=C2=A0should be considered as unusual. Also,= I am aware that most of the statistics' threshold numbers vary in accordan= ce with Hardware Specification, Cassandra Environment Setup. But, what I re= quest here is a general guideline for configuring thresholds for all the me= trics. If this has been already covered, please point me to that resource. If anyo= ne on their own interest collected these things, please share. Any help is appreciated. Best Regards, Julian. ------=_Part_1214_1865296321.1472214113890 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable Thomas,

Not all metrics are KPI= s and are only useful when researching a specific issue or after a use case= specific threshold has been set.

The main "canari= es" I monitor are:
* Pending compactions (dependent on the compac= tion strategy chosen but 1000 is a sign of severe issues in all cases)
* dropped mutations (more than one I treat as a event to investigate,= I believe in allowing operational overhead and any evidence of load sheddi= ng suggests I may not have as much as I thought)
* blocked anythi= ng (flush writers, etc..more than one I investigate)
* system hin= ts ( More than 1k I investigate)
* heap usage and gc time vary a = lot by use case and collector chosen, I aim for below 65% usage as an avera= ge with g1, but this again varies by use case a great deal. Sometimes I jus= t looks the chart and query patterns and if they don't line up I have to do= other deeper investigations
* read and write latencies exceeding= SLA is also use case dependent. Those that have none I tend to push toward= s p99 with a middle end SSD based system having 100ms and a spindle based s= ystem having 600ms with CL one and assuming a "typical" query pattern (agai= n query patterns and CL so vary here)
* cell count and partition = size vary greatly by hardware and gc tuning but I like to in the absence of= all other relevant information like to keep cell count for a partition bel= ow 100k and size below 100mb. I however have many successful use cases runn= ing more and I've had some fail well before that. Hardware and tuning trade= off a shift this around a lot.
There is unfortunately as you'll n= ote a lot of nuance and the load out really changes what looks right (down = to the model of SSDs I have different expectations for p99s if it's a model= I haven't used before I'll do some comparative testing).

The reason so much of this is general and vague is my selection bia= s. I'm brought in when people are complaining about performance or some gra= nd systemic crash because they were monitoring nothing. I have little abili= ty to change hardware initially so I have to be willing to allow the hardwa= re to do the best it can an establish levels where it can no longer keep up= with the customers goals. This may mean for some use cases 10 pending comp= actions is an actionable event for them, for another customer 100 is. The b= etter approach is to establish a baseline for when these metrics start to i= ndicate a serious issue is occurring in that particular app. Basically when= people notice a problem, what did these numbers look like in the minutes, = hours and days prior? That's the way to establish the levels consistently.<= /div>

Regards,

Ryan Svihla







On Fri, Aug 26, 2016 at 4:48 AM -0500, "Thomas J= ulian" <thomasjulian@zoho.com> wrote:

=
Hello,

<= span class=3D"colour" style=3D"color:rgb(51, 51, 51)">I am working on setti= ng up a monitoring tool to monitor Cassandra Instances. Are there any wikis= which specifies optimum value for each Cassandra KPIs?
= For instance, I am not sure,
  1. What value of "Memt= able Columns Count" can be considered as "Normal". 
  2. What value of = the same has to be considered as "Critical".
    =
= I knew threshold numbers for few params, for instance any thing more than z= ero for timeouts, pending tasks = should be considered as unusual. Also, I am aware that most of the s= tatistics' threshold numbers vary in accordance with Hardware Specification= , Cassandra Environment Setup. But, what I request here is a general guidel= ine for configuring thresholds for all the metrics.

= If this has been already covered, please= point me to that resource. If anyone on their own interest collected these= things, please share.

Any help is appreciated.

Best Regards,
Julian.


------=_Part_1214_1865296321.1472214113890--