ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Harp" <michael.h...@teradata.com>
Subject Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold
Date Tue, 06 May 2014 14:53:30 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/#review42289
-----------------------------------------------------------


Whats the expected behavior when namenode HA is enabled?

- Michael Harp


On May 6, 2014, 2:44 p.m., Andrew Onischuk wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21113/
> -----------------------------------------------------------
> 
> (Updated May 6, 2014, 2:44 p.m.)
> 
> 
> Review request for Ambari and Myroslav Papirkovskyy.
> 
> 
> Bugs: AMBARI-5681
>     https://issues.apache.org/jira/browse/AMBARI-5681
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
> 
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.  
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
> 
> STEPS TO REPRODUCE:
> 
>   * SNN fails to merge edit files for any reason
>   * NameNode edit files grow in size
>   * Corruption to edit files.
> 
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
> 
> SUPPORT ANALYSIS: N/A
> 
> Note:
> 
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py
PRE-CREATION 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py
4089b2e 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py
2e41c23 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2
ff03bf9 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2
e7fda1a 
>   ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 
> 
> Diff: https://reviews.apache.org/r/21113/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message