Mailing-List: contact user-help@ambari.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@ambari.apache.org
Subject: Re: Ambari Metrics Collector Process alert - CRITICAL threshold
 rule
MIME-Version: 1.0
From: Siddharth Wagle <swagle@hortonworks.com>
To: Ganesh Viswanathan <gansvv@gmail.com>, "user@ambari.apache.org"
	<user@ambari.apache.org>
Thread-Topic: Ambari Metrics Collector Process alert - CRITICAL threshold rule
Thread-Index: AQHSMUH9ck+HF8s0JUSQmm/RqWls46C+uBeAgAAIpICAAAofAP//i7FC
Date: Fri, 28 Oct 2016 20:48:56 +0000
Message-ID: <1477687736402.84484@hortonworks.com>
References: <CAEaSTZwgbmy-__ekwe1_SXpSoVy44EkLEMZZjM0oG2NdrpgoEQ@mail.gmail.com>
 <C0602A88-025B-443A-9AE4-6D99068CB13A@hortonworks.com>
 <CAEaSTZzdMk4RPJbqeyH3X9ZOaNgVUbAK0NXgOFs1Xyx6A2ZNVQ@mail.gmail.com>,<3A520B29-F8F3-4018-A04B-F255D4D731C4@hortonworks.com>
In-Reply-To: <3A520B29-F8F3-4018-A04B-F255D4D731C4@hortonworks.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_147768773640284484hortonworkscom_"
archived-at: Fri, 28 Oct 2016 20:49:10 -0000

--_000_147768773640284484hortonworkscom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

It means that the alert runner/thread on the agent waits for 1.5 seconds be=
fore raising a WARNING and 5 seconds before raising a CRITICAL message.


The configured interval for often this runs is independent of these threash=
olds consumed by the alert instance.


I am not sure what you mean by "point-in-time", alert_history table does lo=
g execution results although we bubble up the last execution status. @Jonat=
han Hurely might be able to shed more light on the finer details.

BR,
Sid


________________________________
From: Jonathan Hurley <jhurley@hortonworks.com>
Sent: Friday, October 28, 2016 1:44 PM
To: Ganesh Viswanathan
Cc: user@ambari.apache.org
Subject: Re: Ambari Metrics Collector Process alert - CRITICAL threshold ru=
le

In your version of Ambari, the alert will trigger right away. In Ambari 2.4=
, we have the notion of "soft" and "hard" alerts. You can configure it so t=
hat it doesn't trigger alert notifications until n number of CRITICAL alert=
s have been received in a row.

On Oct 28, 2016, at 4:07 PM, Ganesh Viswanathan <gansvv@gmail.com<mailto:ga=
nsvv@gmail.com>> wrote:

Thanks Jonathan, that explains some of the behavior I'm seeing.

Two additional questions:
1)  How do I make sure the Ambari "Metrics Collector Process" does not aler=
t immediately when the process is down? I am using Ambari 2.2.1.0 and it ha=
s a bug [1] which can trigger restarts of the process. The fix for AMBARI-1=
5492<http://issues.apache.org/jira/browse/AMBARI-15492> has been documented=
 in that wiki as "comment out auto-recovery". But that would mean the proce=
ss would not restart (when the bug hits) bringing down visibility into the =
cluster metrics. We have disabled the auto-restart count alert because of t=
he bug, but what is a good way to say "if the metrics collector process has=
 been down for 15mins, then alert".

2) Will restarting "Metrics Collector Process"  impact the other hbase or h=
dfs health alerts? Or is this process only for the Ambari-Metrics system (c=
ollecting usage and internal ambari metrics). I am trying to see if the Amb=
ari Metrics Collector Process can be disabled while still keep the other hb=
ase and hdfs alerts.

[1] https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues


-Ganesh


On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurley <jhurley@hortonworks.com<=
mailto:jhurley@hortonworks.com>> wrote:
It sounds like you're asking two different questions here. Let me see if I =
can address them:

Most "CRITICAL" thresholds do contain different text then their OK/WARNING =
counterparts. This is because there is different information which needs to=
 be conveyed when an alert has gone CRITICAL. In the case of this alert, it=
's a port connection problem. When that happens, administrators are mostly =
interested in the error message and the attempted host:port combination. I'=
m not sure what you mean by "CRITICAL is a point in time alert". All alerts=
 of the PORT/WEB variety are point-in-time alerts. They represent the conne=
ction state of a socket and the data returned over that socket at a specifi=
c point in time. The alert which gets recorded in Ambari's database maintai=
ns the time of the alert. This value is available via a tooltip hover in th=
e UI.

The second part of your question is asking why increasing the timeout value=
 to something large, like 600, doesn't prevent the alert from triggering. I=
 believe this is how the python sockets are being used in that a failed con=
nection is not limited to the same timeout restrictions as a socket which w=
on't respond. If the machine is available and refuses the connection outrig=
ht, then the timeout wouldn't take effect.


On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan <gansvv@gmail.com<mailto:ga=
nsvv@gmail.com>> wrote:

Hello,

The Ambari "Metrics Collector Process" Alert has a different defintion for =
CRITICAL threshold vs. OK and WARNING thresholds. What is the reason for th=
is?

In my tests, CRITICAL seems like a "point-in-time" alert and the value of t=
hat field is not being used. When the metrics collector process is killed o=
r restarts, the alert fires in 1min or less even when I set the threshold v=
alue to 600s. This means the alert description of "This alert is triggered =
if the Metrics Collector cannot be confirmed to be up and listening on the =
configured port for number of seconds equal to threshold." NOT VALID for CR=
ITICAL threshold. Is that true and what is the reason for this discrepancy?=
 Has anyone else gotten false pages because of this and what is the fix?

"ok": {
"text": "TCP OK - {0:.3f}s response on port {1}"
},
"warning": {
"text": "TCP OK - {0:.3f}s response on port {1}",
"value": 1.5
},
"critical": {
"text": "Connection failed: {0} to {1}:{2}",
"value": 5.0
}

Ref:
https://github.com/apache/ambari/blob/2ad42074f1633c5c6f56cf979bdaa49440457=
566/ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/a=
lerts.json#L102

Thanks,
Ganesh


--_000_147768773640284484hortonworkscom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
<style type=3D"text/css" style=3D"display:none"><!--P{margin-top:0;margin-b=
ottom:0;} --></style>
</head>
<body dir=3D"ltr" style=3D"font-size:12pt;color:#000000;background-color:#F=
FFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">
It means that the alert runner/thread on the agent waits for 1.5 seconds be=
fore raising a WARNING and 5 seconds before raising a CRITICAL message.
<p><br>
</p>
<p>The configured interval for often this runs is independent of these thre=
asholds consumed by the alert instance.</p>
<p><br>
</p>
I am not sure what you mean by &quot;point-in-time&quot;, alert_history tab=
le does log execution results although we bubble up the last execution stat=
us. @Jonathan Hurely might be able to shed more light on the finer details.=
<br>
<br>
BR,<br>
Sid<br>
<p><br>
</p>
<div style=3D"word-wrap:break-word">
<hr tabindex=3D"-1" style=3D"display:inline-block; width:98%">
<div id=3D"divRplyFwdMsg" dir=3D"ltr"><font style=3D"font-size:11pt" color=
=3D"#000000" face=3D"Calibri, sans-serif"><b>From:</b> Jonathan Hurley &lt;=
jhurley@hortonworks.com&gt;<br>
<b>Sent:</b> Friday, October 28, 2016 1:44 PM<br>
<b>To:</b> Ganesh Viswanathan<br>
<b>Cc:</b> user@ambari.apache.org<br>
<b>Subject:</b> Re: Ambari Metrics Collector Process alert - CRITICAL thres=
hold rule</font>
<div>&nbsp;</div>
</div>
<div>
<div class=3D"">In your version of Ambari, the alert will trigger right awa=
y. In Ambari 2.4, we have the notion of &quot;soft&quot; and &quot;hard&quo=
t; alerts. You can configure it so that it doesn't trigger alert notificati=
ons until n number of CRITICAL alerts have been received
 in a row.&nbsp;</div>
<br class=3D"">
<div>
<blockquote type=3D"cite" class=3D"">
<div class=3D"">On Oct 28, 2016, at 4:07 PM, Ganesh Viswanathan &lt;<a href=
=3D"mailto:gansvv@gmail.com" class=3D"">gansvv@gmail.com</a>&gt; wrote:</di=
v>
<br class=3D"Apple-interchange-newline">
<div class=3D"">
<div dir=3D"ltr" class=3D"">
<div class=3D"">
<div class=3D"">Thanks Jonathan, that explains some of the behavior I'm see=
ing.<br class=3D"">
<br class=3D"">
</div>
Two additional questions:<br class=3D"">
1)&nbsp; How do I make sure the Ambari &quot;Metrics Collector Process&quot=
; does not alert immediately when the process is down? I am using Ambari 2.=
2.1.0 and it has a bug [1] which can trigger restarts of the process. The f=
ix for
<a href=3D"http://issues.apache.org/jira/browse/AMBARI-15492" class=3D"exte=
rnal-link" rel=3D"nofollow">
AMBARI-15492</a> has been documented in that wiki as &quot;comment out auto=
-recovery&quot;. But that would mean the process would not restart (when th=
e bug hits) bringing down visibility into the cluster metrics. We have disa=
bled the auto-restart count alert because
 of the bug, but what is a good way to say &quot;if the metrics collector p=
rocess has been down for 15mins, then alert&quot;.<br class=3D"">
<br class=3D"">
</div>
<div class=3D"">2) Will restarting &quot;Metrics Collector Process&quot;&nb=
sp; impact the other hbase or hdfs health alerts? Or is this process only f=
or the Ambari-Metrics system (collecting usage and internal ambari metrics)=
. I am trying to see if the Ambari Metrics Collector
 Process can be disabled while still keep the other hbase and hdfs alerts.<=
br class=3D"">
</div>
<div class=3D""><br class=3D"">
[1] <a href=3D"https://cwiki.apache.org/confluence/display/AMBARI/Known&#43=
;Issues" class=3D"">
https://cwiki.apache.org/confluence/display/AMBARI/Known&#43;Issues</a><br =
class=3D"">
<br class=3D"">
</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">-Ganesh<br class=3D"">
<br class=3D"">
</div>
</div>
<div class=3D"gmail_extra"><br class=3D"">
<div class=3D"gmail_quote">On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurle=
y <span dir=3D"ltr" class=3D"">
&lt;<a href=3D"mailto:jhurley@hortonworks.com" target=3D"_blank" class=3D""=
>jhurley@hortonworks.com</a>&gt;</span> wrote:<br class=3D"">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex; border-left:1=
px #ccc solid; padding-left:1ex">
<div class=3D"" style=3D"word-wrap:break-word">
<div class=3D"">It sounds like you're asking two different questions here. =
Let me see if I can address them:</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Most &quot;CRITICAL&quot; thresholds do contain different t=
ext then their OK/WARNING counterparts. This is because there is different =
information which needs to be conveyed when an alert has gone CRITICAL. In =
the case of this alert, it's a port connection
 problem. When that happens, administrators are mostly interested in the er=
ror message and the attempted host:port combination. I'm not sure what you =
mean by &quot;CRITICAL is a point in time alert&quot;. All alerts of the PO=
RT/WEB variety are point-in-time alerts. They
 represent the connection state of a socket and the data returned over that=
 socket at a specific point in time. The alert which gets recorded in Ambar=
i's database maintains the time of the alert. This value is available via a=
 tooltip hover in the UI.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">The second part of your question is asking why increasing t=
he timeout value to something large, like 600, doesn't prevent the alert fr=
om triggering. I believe this is how the python sockets are being used in t=
hat a failed connection is not limited
 to the same timeout restrictions as a socket which won't respond. If the m=
achine is available and refuses the connection outright, then the timeout w=
ouldn't take effect.&nbsp;</div>
<div class=3D"">
<div class=3D"h5">
<div class=3D""><br class=3D"">
</div>
<div class=3D""><br class=3D"">
</div>
<br class=3D"">
<div class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan &lt;<a href=
=3D"mailto:gansvv@gmail.com" target=3D"_blank" class=3D"">gansvv@gmail.com<=
/a>&gt; wrote:</div>
<br class=3D"m_-107592196475830981Apple-interchange-newline">
<div class=3D"">
<div dir=3D"ltr" class=3D"">
<div class=3D"">
<div class=3D"">Hello,<br class=3D"">
<br class=3D"">
</div>
The Ambari &quot;Metrics Collector Process&quot; Alert has a different defi=
ntion for CRITICAL threshold vs. OK and WARNING thresholds. What is the rea=
son for this?
<br class=3D"">
<br class=3D"">
</div>
In my tests, CRITICAL seems like a &quot;point-in-time&quot; alert and the =
value of that field is not being used. When the metrics collector process i=
s killed or restarts, the alert fires in 1min or less even when I set the t=
hreshold value to 600s. This means the alert
 description of &quot;<i class=3D"">This alert is triggered if the Metrics =
Collector cannot be confirmed to be up and listening on the configured port=
 for number of seconds equal to threshold.&quot;</i> NOT VALID for CRITICAL=
 threshold. Is that true and what is the reason
 for this discrepancy? Has anyone else gotten false pages because of this a=
nd what is the fix?<br class=3D"">
<div class=3D"">
<div class=3D""><br class=3D"">
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC94" class=3D"m_-107592196475830981gm=
ail-blob-code m_-107592196475830981gmail-js-file-line m_-107592196475830981=
gmail-blob-code-inner">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>ok<span class=3D"m_-107592196475830981g=
mail-pl-pds">&quot;</span></span>: {</td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC95" class=3D"m_-107592196475830981gm=
ail-blob-code m_-107592196475830981gmail-js-file-line m_-107592196475830981=
gmail-blob-code-inner">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>text<span class=3D"m_-10759219647583098=
1gmail-pl-pds">&quot;</span></span>:
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>TCP OK - {0:.3f}s response on port {1}<=
span class=3D"m_-107592196475830981gmail-pl-pds">&quot;</span></span></td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC96" class=3D"m_-107592196475830981gm=
ail-blob-code m_-107592196475830981gmail-js-file-line m_-107592196475830981=
gmail-blob-code-inner">
},</td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC97" class=3D"m_-107592196475830981gm=
ail-blob-code m_-107592196475830981gmail-js-file-line m_-107592196475830981=
gmail-blob-code-inner">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>warning<span class=3D"m_-10759219647583=
0981gmail-pl-pds">&quot;</span></span>: {</td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC98" class=3D"m_-107592196475830981gm=
ail-blob-code m_-107592196475830981gmail-js-file-line m_-107592196475830981=
gmail-blob-code-inner">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>text<span class=3D"m_-10759219647583098=
1gmail-pl-pds">&quot;</span></span>:
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>TCP OK - {0:.3f}s response on port {1}<=
span class=3D"m_-107592196475830981gmail-pl-pds">&quot;</span></span>,</td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC99" class=3D"m_-107592196475830981gm=
ail-blob-code m_-107592196475830981gmail-js-file-line m_-107592196475830981=
gmail-blob-code-inner">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>value<span class=3D"m_-1075921964758309=
81gmail-pl-pds">&quot;</span></span>:
<span class=3D"m_-107592196475830981gmail-pl-c1">1.5</span></td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC100" class=3D"m_-107592196475830981g=
mail-blob-code m_-107592196475830981gmail-js-file-line m_-10759219647583098=
1gmail-blob-code-inner">
},</td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC101" class=3D"m_-107592196475830981g=
mail-blob-code m_-107592196475830981gmail-js-file-line m_-10759219647583098=
1gmail-blob-code-inner">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>critical<span class=3D"m_-1075921964758=
30981gmail-pl-pds">&quot;</span></span>: {</td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC102" class=3D"m_-107592196475830981g=
mail-blob-code m_-107592196475830981gmail-js-file-line m_-10759219647583098=
1gmail-blob-code-inner" style=3D"background-color:rgb(248,238,199)">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>text<span class=3D"m_-10759219647583098=
1gmail-pl-pds">&quot;</span></span>:
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>Connection failed: {0} to {1}:{2}<span =
class=3D"m_-107592196475830981gmail-pl-pds">&quot;</span></span>,</td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
<table class=3D"m_-107592196475830981gmail-highlight m_-107592196475830981g=
mail-tab-size m_-107592196475830981gmail-js-file-line-container">
<tbody class=3D"">
<tr class=3D"">
<td id=3D"m_-107592196475830981gmail-LC103" class=3D"m_-107592196475830981g=
mail-blob-code m_-107592196475830981gmail-js-file-line m_-10759219647583098=
1gmail-blob-code-inner">
<span class=3D"m_-107592196475830981gmail-pl-s"><span class=3D"m_-107592196=
475830981gmail-pl-pds">&quot;</span>value<span class=3D"m_-1075921964758309=
81gmail-pl-pds">&quot;</span></span>:
<span class=3D"m_-107592196475830981gmail-pl-c1">5.0</span></td>
</tr>
<tr class=3D"">
</tr>
</tbody>
</table>
}<br class=3D"">
<br class=3D"">
Ref:<br class=3D"">
<a href=3D"https://github.com/apache/ambari/blob/2ad42074f1633c5c6f56cf979b=
daa49440457566/ambari-server/src/main/resources/common-services/AMBARI_METR=
ICS/0.1.0/alerts.json#L102" target=3D"_blank" class=3D"">https://github.com=
/apache/<wbr class=3D"">ambari/blob/<wbr class=3D"">2ad42074f1633c5c6f56cf9=
79bdaa4<wbr class=3D"">9440457566/ambari-server/src/<wbr class=3D"">main/re=
sources/common-<wbr class=3D"">services/AMBARI_METRICS/0.1.0/<wbr class=3D"=
">alerts.json#L102</a><br class=3D"">
<br class=3D"">
</div>
<div class=3D"">Thanks,<br class=3D"">
</div>
<div class=3D"">Ganesh<br class=3D"">
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</body>
</html>

--_000_147768773640284484hortonworkscom_--