ambari-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ganesh Viswanathan <>
Subject Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule
Date Fri, 28 Oct 2016 20:30:50 GMT
Thanks Sid, appreciate the info about hbase and hdfs alerts. I'll work on
upgrading Ambari but it will probably take time. One other question about
the alert.

*What does the value in seconds in the 'Metrics Collector Process' alert
mean? *The Ambari definition says:* "This alert is triggered if the Metrics
Collector cannot be confirmed to be up and listening on the configured port
for number of seconds equal to threshold."*

Is it the number of seconds the process was not active and reachable when
the check happened? But if its a point-in-time check and the check is done
every 1minute, why does this have a default of 1.5s and 5s for WARNING and


On Fri, Oct 28, 2016 at 1:25 PM, Siddharth Wagle <>

> Hi Ganesh,
> AMS in AMbari version 2.2.1 had some perf impact due to HBase noramlizer,
> I would recommend upgrading to 2.4.1 if possible.
> Regarding 2] The HBase and HDFS alerts are not all based off of AMS only
> the Namenode alerts ending with "_hourly" or "_daily" depend on AMS.
> Other alerts either are port/pid or jmx based.
> - Sid
> ------------------------------
> *From:* Ganesh Viswanathan <>
> *Sent:* Friday, October 28, 2016 1:07 PM
> *To:* Jonathan Hurley
> *Cc:*
> *Subject:* Re: Ambari Metrics Collector Process alert - CRITICAL
> threshold rule
> Thanks Jonathan, that explains some of the behavior I'm seeing.
> Two additional questions:
> 1)  How do I make sure the Ambari "Metrics Collector Process" does not
> alert immediately when the process is down? I am using Ambari and
> it has a bug [1] which can trigger restarts of the process. The fix for
> AMBARI-15492 <> has been
> documented in that wiki as "comment out auto-recovery". But that would mean
> the process would not restart (when the bug hits) bringing down visibility
> into the cluster metrics. We have disabled the auto-restart count alert
> because of the bug, but what is a good way to say "if the metrics collector
> process has been down for 15mins, then alert".
> 2) Will restarting "Metrics Collector Process"  impact the other hbase or
> hdfs health alerts? Or is this process only for the Ambari-Metrics system
> (collecting usage and internal ambari metrics). I am trying to see if the
> Ambari Metrics Collector Process can be disabled while still keep the other
> hbase and hdfs alerts.
> [1]
> -Ganesh
> On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurley <
> > wrote:
>> It sounds like you're asking two different questions here. Let me see if
>> I can address them:
>> Most "CRITICAL" thresholds do contain different text then their
>> OK/WARNING counterparts. This is because there is different information
>> which needs to be conveyed when an alert has gone CRITICAL. In the case of
>> this alert, it's a port connection problem. When that happens,
>> administrators are mostly interested in the error message and the attempted
>> host:port combination. I'm not sure what you mean by "CRITICAL is a point
>> in time alert". All alerts of the PORT/WEB variety are point-in-time
>> alerts. They represent the connection state of a socket and the data
>> returned over that socket at a specific point in time. The alert which gets
>> recorded in Ambari's database maintains the time of the alert. This value
>> is available via a tooltip hover in the UI.
>> The second part of your question is asking why increasing the timeout
>> value to something large, like 600, doesn't prevent the alert from
>> triggering. I believe this is how the python sockets are being used in that
>> a failed connection is not limited to the same timeout restrictions as a
>> socket which won't respond. If the machine is available and refuses the
>> connection outright, then the timeout wouldn't take effect.
>> On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan <> wrote:
>> Hello,
>> The Ambari "Metrics Collector Process" Alert has a different defintion
>> for CRITICAL threshold vs. OK and WARNING thresholds. What is the reason
>> for this?
>> In my tests, CRITICAL seems like a "point-in-time" alert and the value of
>> that field is not being used. When the metrics collector process is killed
>> or restarts, the alert fires in 1min or less even when I set the threshold
>> value to 600s. This means the alert description of "*This alert is
>> triggered if the Metrics Collector cannot be confirmed to be up and
>> listening on the configured port for number of seconds equal to threshold."*
>> NOT VALID for CRITICAL threshold. Is that true and what is the reason for
>> this discrepancy? Has anyone else gotten false pages because of this and
>> what is the fix?
>> "ok": {
>> "text": "TCP OK - {0:.3f}s response on port {1}"
>> },
>> "warning": {
>> "text": "TCP OK - {0:.3f}s response on port {1}",
>> "value": 1.5
>> },
>> "critical": {
>> "text": "Connection failed: {0} to {1}:{2}",
>> "value": 5.0
>> }
>> Ref:
>> 979bdaa49440457566/ambari-server/src/main/resources/
>> common-services/AMBARI_METRICS/0.1.0/alerts.json#L102
>> Thanks,
>> Ganesh

View raw message