ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Lysnichenko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-7791) HBase Master CPU utilization alert is not suppressed at MM
Date Thu, 16 Oct 2014 16:20:33 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitry Lysnichenko updated AMBARI-7791:
---------------------------------------
    Attachment: AMBARI-7791_branch-1.7.0.patch.2
                AMBARI-7791.patch.2

> HBase Master CPU utilization alert is not suppressed at MM
> ----------------------------------------------------------
>
>                 Key: AMBARI-7791
>                 URL: https://issues.apache.org/jira/browse/AMBARI-7791
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 1.7.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.7.0
>
>         Attachments: AMBARI-7791.patch, AMBARI-7791.patch.1, AMBARI-7791.patch.2, AMBARI-7791_branch-1.7.0.patch,
AMBARI-7791_branch-1.7.0.patch.1, AMBARI-7791_branch-1.7.0.patch.2
>
>
> Looks like we have a design flaw that affects suppressing some alerts. It causes a rare
bug that probably affects 1.6.1.
> h2. The short story
> When we put HBase Master (or entire HBase service) into MM and then stop HBase Master,
the alert "HBase Master CPU utilization" pops up and is not suppressed. This issue reproduces
only when HBase Master is located on a separate host then Nagios server. 
> h2. How suppressing alerts works 
> When we put some service/host/host component into MM, at the server we build a complete
map of host components that are in MM and post it to an agent. Agent writes down this info
to file /var/nagios/ignore.dat in a form:
> {code}
> vm-3.vm GANGLIA GANGLIA_MONITOR
> vm-0.vm HBASE HBASE_MASTER
> vm-3.vm HDFS DATANODE
> vm-2.vm HBASE HBASE_REGIONSERVER
> vm-0.vm HBASE HBASE_REGIONSERVER
> vm-1.vm HBASE HBASE_REGIONSERVER
> vm-3.vm YARN NODEMANAGER
> vm-3.vm HBASE HBASE_REGIONSERVER
> {code}
> All alerts at Nagios are wrapped into shell script (check_wrapper.sh). When any alert
is generated, this wrapper checks  if the hostname, service name and component name for this
alert are present at /var/nagios/ignore.dat. If yes, alert is suppressed
> h2. What exactly is broken
> At jira https://issues.apache.org/jira/browse/AMBARI-6358 we had a requirement to have
only one 'HBase Master CPU utilization' check even in HA mode. So this check is bound to Nagios
host (to be executed only once even if hbase master hostgroup has more than one host, like
it is done for "* Percent Count" alerts). As a result, Hbase Master alert origin data does
not match any entry at file /var/nagios/ignore.dat . That's why the alert is not suppressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message