Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F200B180CF for ; Wed, 28 Oct 2015 12:38:29 +0000 (UTC) Received: (qmail 21000 invoked by uid 500); 28 Oct 2015 12:38:28 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 20827 invoked by uid 500); 28 Oct 2015 12:38:28 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 20476 invoked by uid 99); 28 Oct 2015 12:38:28 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Oct 2015 12:38:28 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 374672C1F58 for ; Wed, 28 Oct 2015 12:38:28 +0000 (UTC) Date: Wed, 28 Oct 2015 12:38:28 +0000 (UTC) From: "Hudson (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978342#comment-14978342 ] Hudson commented on HDFS-9311: ------------------------------ FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #545 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/545/]) HDFS-9311. Support optional offload of NameNode HA service health checks (cnauroth: rev bf8e45298218f70e38838152f69c7705d8606bd6) * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestHealthMonitorWithDedicatedHealthAddress.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRespectsBindHostKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestHealthMonitor.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAServiceTarget.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/NNHAServiceTarget.java * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/DummyHAService.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestNNHealthCheck.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java > Support optional offload of NameNode HA service health checks to a separate RPC server. > --------------------------------------------------------------------------------------- > > Key: HDFS-9311 > URL: https://issues.apache.org/jira/browse/HDFS-9311 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode > Reporter: Chris Nauroth > Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: HDFS-9311.001.patch, HDFS-9311.002.patch, HDFS-9311.003.patch > > > When a NameNode is overwhelmed with load, it can lead to resource exhaustion of the RPC handler pools (both client-facing and service-facing). Eventually, this blocks the health check RPC issued from ZKFC, which triggers a failover. Depending on fencing configuration, the former active NameNode may be killed. In an overloaded situation, the new active NameNode is likely to suffer the same fate, because client load patterns don't change after the failover. This can degenerate into flapping between the 2 NameNodes without real recovery. If a NameNode had been killed by fencing, then it would have to transition through safe mode, further delaying time to recovery. > This issue proposes a separate, optional RPC server at the NameNode for isolating the HA health checks. These health checks are lightweight operations that do not suffer from contention issues on the namesystem lock or other shared resources. Isolating the RPC handlers is sufficient to avoid this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)