Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 077C3200BB8 for ; Fri, 28 Oct 2016 22:49:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 06265160AE4; Fri, 28 Oct 2016 20:49:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0A039160ACA for ; Fri, 28 Oct 2016 22:49:08 +0200 (CEST) Received: (qmail 32802 invoked by uid 500); 28 Oct 2016 20:49:08 -0000 Mailing-List: contact user-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@ambari.apache.org Delivered-To: mailing list user@ambari.apache.org Received: (qmail 32792 invoked by uid 99); 28 Oct 2016 20:49:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2016 20:49:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A909A189B5F for ; Fri, 28 Oct 2016 20:49:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.778 X-Spam-Level: ** X-Spam-Status: No, score=2.778 tagged_above=-999 required=6.31 tests=[FSL_HELO_BARE_IP_2=1.499, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id sL4cwKyyiazs for ; Fri, 28 Oct 2016 20:49:05 +0000 (UTC) Received: from relayvx11b.securemail.intermedia.net (relayvx11b.securemail.intermedia.net [64.78.52.184]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id CF9195F245 for ; Fri, 28 Oct 2016 20:49:04 +0000 (UTC) Received: from emg-ca-1-1.securemail.intermedia.net (localhost [127.0.0.1]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by emg-ca-1-1.localdomain (Postfix) with ESMTPS id 5B15353F8C; Fri, 28 Oct 2016 13:48:58 -0700 (PDT) Subject: Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule MIME-Version: 1.0 x-echoworx-msg-id: 02612e9b-ac88-4678-917a-cfd80c287865 x-echoworx-emg-received: Fri, 28 Oct 2016 13:48:58.260 -0700 x-echoworx-message-code-hashed: d59a88c86965d6b9c6d18eea1b8add1284a6e1d36dc494903aa3c4276bd4907b x-echoworx-action: delivered Received: from 10.254.155.14 ([10.254.155.14]) by emg-ca-1-1 (JAMES SMTP Server 2.3.2) with SMTP ID 671; Fri, 28 Oct 2016 13:48:58 -0700 (PDT) Received: from MBX080-W4-CO-2.exch080.serverpod.net (unknown [10.224.117.102]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by emg-ca-1-1.localdomain (Postfix) with ESMTPS id E6DF153F8C; Fri, 28 Oct 2016 13:48:57 -0700 (PDT) Received: from MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) by MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) with Microsoft SMTP Server (TLS) id 15.0.1178.4; Fri, 28 Oct 2016 13:48:56 -0700 Received: from MBX080-W4-CO-2.exch080.serverpod.net ([10.224.117.102]) by mbx080-w4-co-2.exch080.serverpod.net ([10.224.117.102]) with mapi id 15.00.1178.000; Fri, 28 Oct 2016 13:48:56 -0700 From: Siddharth Wagle To: Ganesh Viswanathan , "user@ambari.apache.org" Thread-Topic: Ambari Metrics Collector Process alert - CRITICAL threshold rule Thread-Index: AQHSMUH9ck+HF8s0JUSQmm/RqWls46C+uBeAgAAIpICAAAofAP//i7FC Date: Fri, 28 Oct 2016 20:48:56 +0000 Message-ID: <1477687736402.84484@hortonworks.com> References: ,<3A520B29-F8F3-4018-A04B-F255D4D731C4@hortonworks.com> In-Reply-To: <3A520B29-F8F3-4018-A04B-F255D4D731C4@hortonworks.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [73.15.200.223] x-source-routing-agent: Processed Content-Type: multipart/alternative; boundary="_000_147768773640284484hortonworkscom_" archived-at: Fri, 28 Oct 2016 20:49:10 -0000 --_000_147768773640284484hortonworkscom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable It means that the alert runner/thread on the agent waits for 1.5 seconds be= fore raising a WARNING and 5 seconds before raising a CRITICAL message. The configured interval for often this runs is independent of these threash= olds consumed by the alert instance. I am not sure what you mean by "point-in-time", alert_history table does lo= g execution results although we bubble up the last execution status. @Jonat= han Hurely might be able to shed more light on the finer details. BR, Sid ________________________________ From: Jonathan Hurley Sent: Friday, October 28, 2016 1:44 PM To: Ganesh Viswanathan Cc: user@ambari.apache.org Subject: Re: Ambari Metrics Collector Process alert - CRITICAL threshold ru= le In your version of Ambari, the alert will trigger right away. In Ambari 2.4= , we have the notion of "soft" and "hard" alerts. You can configure it so t= hat it doesn't trigger alert notifications until n number of CRITICAL alert= s have been received in a row. On Oct 28, 2016, at 4:07 PM, Ganesh Viswanathan > wrote: Thanks Jonathan, that explains some of the behavior I'm seeing. Two additional questions: 1) How do I make sure the Ambari "Metrics Collector Process" does not aler= t immediately when the process is down? I am using Ambari 2.2.1.0 and it ha= s a bug [1] which can trigger restarts of the process. The fix for AMBARI-1= 5492 has been documented= in that wiki as "comment out auto-recovery". But that would mean the proce= ss would not restart (when the bug hits) bringing down visibility into the = cluster metrics. We have disabled the auto-restart count alert because of t= he bug, but what is a good way to say "if the metrics collector process has= been down for 15mins, then alert". 2) Will restarting "Metrics Collector Process" impact the other hbase or h= dfs health alerts? Or is this process only for the Ambari-Metrics system (c= ollecting usage and internal ambari metrics). I am trying to see if the Amb= ari Metrics Collector Process can be disabled while still keep the other hb= ase and hdfs alerts. [1] https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues -Ganesh On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurley > wrote: It sounds like you're asking two different questions here. Let me see if I = can address them: Most "CRITICAL" thresholds do contain different text then their OK/WARNING = counterparts. This is because there is different information which needs to= be conveyed when an alert has gone CRITICAL. In the case of this alert, it= 's a port connection problem. When that happens, administrators are mostly = interested in the error message and the attempted host:port combination. I'= m not sure what you mean by "CRITICAL is a point in time alert". All alerts= of the PORT/WEB variety are point-in-time alerts. They represent the conne= ction state of a socket and the data returned over that socket at a specifi= c point in time. The alert which gets recorded in Ambari's database maintai= ns the time of the alert. This value is available via a tooltip hover in th= e UI. The second part of your question is asking why increasing the timeout value= to something large, like 600, doesn't prevent the alert from triggering. I= believe this is how the python sockets are being used in that a failed con= nection is not limited to the same timeout restrictions as a socket which w= on't respond. If the machine is available and refuses the connection outrig= ht, then the timeout wouldn't take effect. On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan > wrote: Hello, The Ambari "Metrics Collector Process" Alert has a different defintion for = CRITICAL threshold vs. OK and WARNING thresholds. What is the reason for th= is? In my tests, CRITICAL seems like a "point-in-time" alert and the value of t= hat field is not being used. When the metrics collector process is killed o= r restarts, the alert fires in 1min or less even when I set the threshold v= alue to 600s. This means the alert description of "This alert is triggered = if the Metrics Collector cannot be confirmed to be up and listening on the = configured port for number of seconds equal to threshold." NOT VALID for CR= ITICAL threshold. Is that true and what is the reason for this discrepancy?= Has anyone else gotten false pages because of this and what is the fix? "ok": { "text": "TCP OK - {0:.3f}s response on port {1}" }, "warning": { "text": "TCP OK - {0:.3f}s response on port {1}", "value": 1.5 }, "critical": { "text": "Connection failed: {0} to {1}:{2}", "value": 5.0 } Ref: https://github.com/apache/ambari/blob/2ad42074f1633c5c6f56cf979bdaa49440457= 566/ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/a= lerts.json#L102 Thanks, Ganesh --_000_147768773640284484hortonworkscom_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable It means that the alert runner/thread on the agent waits for 1.5 seconds be= fore raising a WARNING and 5 seconds before raising a CRITICAL message.


The configured interval for often this runs is independent of these thre= asholds consumed by the alert instance.


I am not sure what you mean by "point-in-time", alert_history tab= le does log execution results although we bubble up the last execution stat= us. @Jonathan Hurely might be able to shed more light on the finer details.=

BR,
Sid



From: Jonathan Hurley <= jhurley@hortonworks.com>
Sent: Friday, October 28, 2016 1:44 PM
To: Ganesh Viswanathan
Cc: user@ambari.apache.org
Subject: Re: Ambari Metrics Collector Process alert - CRITICAL thres= hold rule
 
In your version of Ambari, the alert will trigger right awa= y. In Ambari 2.4, we have the notion of "soft" and "hard&quo= t; alerts. You can configure it so that it doesn't trigger alert notificati= ons until n number of CRITICAL alerts have been received in a row. 

On Oct 28, 2016, at 4:07 PM, Ganesh Viswanathan <gansvv@gmail.com> wrote:
Thanks Jonathan, that explains some of the behavior I'm see= ing.

Two additional questions:
1)  How do I make sure the Ambari "Metrics Collector Process"= ; does not alert immediately when the process is down? I am using Ambari 2.= 2.1.0 and it has a bug [1] which can trigger restarts of the process. The f= ix for AMBARI-15492 has been documented in that wiki as "comment out auto= -recovery". But that would mean the process would not restart (when th= e bug hits) bringing down visibility into the cluster metrics. We have disa= bled the auto-restart count alert because of the bug, but what is a good way to say "if the metrics collector p= rocess has been down for 15mins, then alert".

2) Will restarting "Metrics Collector Process"&nb= sp; impact the other hbase or hdfs health alerts? Or is this process only f= or the Ambari-Metrics system (collecting usage and internal ambari metrics)= . I am trying to see if the Ambari Metrics Collector Process can be disabled while still keep the other hbase and hdfs alerts.<= br class=3D"">

-Ganesh


On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurle= y <jhurley@hortonworks.com> wrote:
It sounds like you're asking two different questions here. = Let me see if I can address them:

Most "CRITICAL" thresholds do contain different t= ext then their OK/WARNING counterparts. This is because there is different = information which needs to be conveyed when an alert has gone CRITICAL. In = the case of this alert, it's a port connection problem. When that happens, administrators are mostly interested in the er= ror message and the attempted host:port combination. I'm not sure what you = mean by "CRITICAL is a point in time alert". All alerts of the PO= RT/WEB variety are point-in-time alerts. They represent the connection state of a socket and the data returned over that= socket at a specific point in time. The alert which gets recorded in Ambar= i's database maintains the time of the alert. This value is available via a= tooltip hover in the UI.

The second part of your question is asking why increasing t= he timeout value to something large, like 600, doesn't prevent the alert fr= om triggering. I believe this is how the python sockets are being used in t= hat a failed connection is not limited to the same timeout restrictions as a socket which won't respond. If the m= achine is available and refuses the connection outright, then the timeout w= ouldn't take effect. 


--_000_147768773640284484hortonworkscom_--