Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A32A218CA7 for ; Tue, 27 Oct 2015 06:09:28 +0000 (UTC) Received: (qmail 84604 invoked by uid 500); 27 Oct 2015 06:09:28 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 84548 invoked by uid 500); 27 Oct 2015 06:09:28 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 84529 invoked by uid 99); 27 Oct 2015 06:09:28 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Oct 2015 06:09:28 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id CBD632C1F58 for ; Tue, 27 Oct 2015 06:09:27 +0000 (UTC) Date: Tue, 27 Oct 2015 06:09:27 +0000 (UTC) From: "Hadoop QA (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975788#comment-14975788 ] Hadoop QA commented on HDFS-9311: --------------------------------- \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 20m 8s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 9m 17s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 47s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 29s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 56s | The applied patch generated 1 new checkstyle issues (total was 12, now 12). | | {color:red}-1{color} | whitespace | 0m 6s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 53s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 43s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | common tests | 9m 44s | Tests failed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 68m 49s | Tests failed in hadoop-hdfs. | | | | 129m 55s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.ha.TestZKFailoverControllerStress | | | hadoop.ha.TestZKFailoverController | | | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes | | | hadoop.hdfs.server.blockmanagement.TestNodeCount | | | hadoop.hdfs.util.TestByteArrayManager | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistFiles | | Timed out tests | org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyWriter | | | org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestInterDatanodeProtocol | | | org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestScrLazyPersistFiles | | | org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12768861/HDFS-9311.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 96677be | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13210/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13210/artifact/patchprocess/diffcheckstylehadoop-common.txt | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/13210/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13210/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13210/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13210/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13210/console | This message was automatically generated. > Support optional offload of NameNode HA service health checks to a separate RPC server. > --------------------------------------------------------------------------------------- > > Key: HDFS-9311 > URL: https://issues.apache.org/jira/browse/HDFS-9311 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, namenode > Reporter: Chris Nauroth > Assignee: Chris Nauroth > Attachments: HDFS-9311.001.patch > > > When a NameNode is overwhelmed with load, it can lead to resource exhaustion of the RPC handler pools (both client-facing and service-facing). Eventually, this blocks the health check RPC issued from ZKFC, which triggers a failover. Depending on fencing configuration, the former active NameNode may be killed. In an overloaded situation, the new active NameNode is likely to suffer the same fate, because client load patterns don't change after the failover. This can degenerate into flapping between the 2 NameNodes without real recovery. If a NameNode had been killed by fencing, then it would have to transition through safe mode, further delaying time to recovery. > This issue proposes a separate, optional RPC server at the NameNode for isolating the HA health checks. These health checks are lightweight operations that do not suffer from contention issues on the namesystem lock or other shared resources. Isolating the RPC handlers is sufficient to avoid this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)