Return-Path: X-Original-To: apmail-ambari-dev-archive@www.apache.org Delivered-To: apmail-ambari-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1618F1128A for ; Tue, 6 May 2014 15:02:51 +0000 (UTC) Received: (qmail 90023 invoked by uid 500); 6 May 2014 14:44:25 -0000 Delivered-To: apmail-ambari-dev-archive@ambari.apache.org Received: (qmail 90008 invoked by uid 500); 6 May 2014 14:44:25 -0000 Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list dev@ambari.apache.org Received: (qmail 89996 invoked by uid 99); 6 May 2014 14:44:25 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 May 2014 14:44:25 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id E0A7C1D7591; Tue, 6 May 2014 14:44:18 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============2730746174703618480==" MIME-Version: 1.0 Subject: Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold From: "Andrew Onischuk" To: "Myroslav Papirkovskyy" Cc: "Andrew Onischuk" , "Ambari" Date: Tue, 06 May 2014 14:44:18 -0000 Message-ID: <20140506144418.7613.25937@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Andrew Onischuk" X-ReviewGroup: Ambari X-ReviewRequest-URL: https://reviews.apache.org/r/21113/ X-Sender: "Andrew Onischuk" Reply-To: "Andrew Onischuk" X-ReviewRequest-Repository: ambari --===============2730746174703618480== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/21113/ ----------------------------------------------------------- Review request for Ambari and Myroslav Papirkovskyy. Bugs: AMBARI-5681 https://issues.apache.org/jira/browse/AMBARI-5681 Repository: ambari Description ------- Description: If the secondary NameNode(SNN) failed to merge edit files for any reason, Nagios doesn't alert on it. PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes undetected. This can cause the edit files to become very large and slows down NameNode performance. And in some cases, can lead to corruption of NameNode edit files. BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will eventually cause long downtime for all of customers and a possiblitly of data loss. STEPS TO REPRODUCE: * SNN fails to merge edit files for any reason * NameNode edit files grow in size * Corruption to edit files. ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm EXPECTED BEHAVIOR: Nagios should fire critical alarm SUPPORT ANALYSIS: N/A Note: We need to get this fixed and alert our customers to add the nagios alarm ASAP. Diffs ----- ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23 ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9 ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 Diff: https://reviews.apache.org/r/21113/diff/ Testing ------- Thanks, Andrew Onischuk --===============2730746174703618480==--