ambari-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddharth Wagle <swa...@hortonworks.com>
Subject Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule
Date Fri, 28 Oct 2016 20:25:37 GMT
Hi Ganesh,


AMS in AMbari version 2.2.1 had some perf impact due to HBase noramlizer, I would recommend
upgrading to 2.4.1 if possible.


Regarding 2] The HBase and HDFS alerts are not all based off of AMS only the Namenode alerts
ending with "_hourly" or "_daily" depend on AMS. Other alerts either are port/pid or jmx based.


- Sid


________________________________
From: Ganesh Viswanathan <gansvv@gmail.com>
Sent: Friday, October 28, 2016 1:07 PM
To: Jonathan Hurley
Cc: user@ambari.apache.org
Subject: Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule

Thanks Jonathan, that explains some of the behavior I'm seeing.

Two additional questions:
1)  How do I make sure the Ambari "Metrics Collector Process" does not alert immediately when
the process is down? I am using Ambari 2.2.1.0 and it has a bug [1] which can trigger restarts
of the process. The fix for AMBARI-15492<http://issues.apache.org/jira/browse/AMBARI-15492>
has been documented in that wiki as "comment out auto-recovery". But that would mean the process
would not restart (when the bug hits) bringing down visibility into the cluster metrics. We
have disabled the auto-restart count alert because of the bug, but what is a good way to say
"if the metrics collector process has been down for 15mins, then alert".

2) Will restarting "Metrics Collector Process"  impact the other hbase or hdfs health alerts?
Or is this process only for the Ambari-Metrics system (collecting usage and internal ambari
metrics). I am trying to see if the Ambari Metrics Collector Process can be disabled while
still keep the other hbase and hdfs alerts.

[1] https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues


-Ganesh


On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurley <jhurley@hortonworks.com<mailto:jhurley@hortonworks.com>>
wrote:
It sounds like you're asking two different questions here. Let me see if I can address them:

Most "CRITICAL" thresholds do contain different text then their OK/WARNING counterparts. This
is because there is different information which needs to be conveyed when an alert has gone
CRITICAL. In the case of this alert, it's a port connection problem. When that happens, administrators
are mostly interested in the error message and the attempted host:port combination. I'm not
sure what you mean by "CRITICAL is a point in time alert". All alerts of the PORT/WEB variety
are point-in-time alerts. They represent the connection state of a socket and the data returned
over that socket at a specific point in time. The alert which gets recorded in Ambari's database
maintains the time of the alert. This value is available via a tooltip hover in the UI.

The second part of your question is asking why increasing the timeout value to something large,
like 600, doesn't prevent the alert from triggering. I believe this is how the python sockets
are being used in that a failed connection is not limited to the same timeout restrictions
as a socket which won't respond. If the machine is available and refuses the connection outright,
then the timeout wouldn't take effect.



On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan <gansvv@gmail.com<mailto:gansvv@gmail.com>>
wrote:

Hello,

The Ambari "Metrics Collector Process" Alert has a different defintion for CRITICAL threshold
vs. OK and WARNING thresholds. What is the reason for this?

In my tests, CRITICAL seems like a "point-in-time" alert and the value of that field is not
being used. When the metrics collector process is killed or restarts, the alert fires in 1min
or less even when I set the threshold value to 600s. This means the alert description of "This
alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on
the configured port for number of seconds equal to threshold." NOT VALID for CRITICAL threshold.
Is that true and what is the reason for this discrepancy? Has anyone else gotten false pages
because of this and what is the fix?

"ok": {
"text": "TCP OK - {0:.3f}s response on port {1}"
},
"warning": {
"text": "TCP OK - {0:.3f}s response on port {1}",
"value": 1.5
},
"critical": {
"text": "Connection failed: {0} to {1}:{2}",
"value": 5.0
}

Ref:
https://github.com/apache/ambari/blob/2ad42074f1633c5c6f56cf979bdaa49440457566/ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/alerts.json#L102

Thanks,
Ganesh



Mime
View raw message