ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Onischuk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-5681) Add Nagios alert if HDFS last checkpoint time exceeds threshold
Date Tue, 13 May 2014 17:54:15 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996694#comment-13996694
] 

Andrew Onischuk commented on AMBARI-5681:
-----------------------------------------

Committed to branch-1.6.0

> Add Nagios alert if HDFS last checkpoint time exceeds threshold
> ---------------------------------------------------------------
>
>                 Key: AMBARI-5681
>                 URL: https://issues.apache.org/jira/browse/AMBARI-5681
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Andrew Onischuk
>            Assignee: Andrew Onischuk
>             Fix For: 1.6.0
>
>
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.  
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
> STEPS TO REPRODUCE:
>   * SNN fails to merge edit files for any reason
>   * NameNode edit files grow in size
>   * Corruption to edit files.
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
> SUPPORT ANALYSIS: N/A
> Note:
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message