Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 100B411952 for ; Tue, 26 Aug 2014 00:06:59 +0000 (UTC) Received: (qmail 97173 invoked by uid 500); 26 Aug 2014 00:06:58 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 97127 invoked by uid 500); 26 Aug 2014 00:06:58 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 96948 invoked by uid 99); 26 Aug 2014 00:06:58 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Aug 2014 00:06:58 +0000 Date: Tue, 26 Aug 2014 00:06:58 +0000 (UTC) From: "Hadoop QA (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-6184) Capture NN's thread dump when it fails over MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110022#comment-14110022 ] Hadoop QA commented on HDFS-6184: --------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664231/HDFS-6184.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.ipc.TestFairCallQueue org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl org.apache.hadoop.ipc.TestIPC org.apache.hadoop.ipc.TestCallQueueManager The test build failed in hadoop-hdfs-project/hadoop-hdfs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7756//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7756//console This message is automatically generated. > Capture NN's thread dump when it fails over > ------------------------------------------- > > Key: HDFS-6184 > URL: https://issues.apache.org/jira/browse/HDFS-6184 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Ming Ma > Assignee: Ming Ma > Attachments: HDFS-6184.patch > > > We have seen several false positives in terms of when ZKFC considers NN to be unhealthy. Some of these triggers unnecessary failover. Examples, > 1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence isn't bad; just that SBN will quit ZK membership and rejoin it later. But it is unnecessary. The reason is checkpoint acquires NN global write lock and all rpc requests are blocked. Even though HAServiceProtocol.monitorHealth doesn't need to acquire NN lock; it still needs to user service rpc queue. > 2. When ANN is busy, sometimes the global lock can block other requests. ZKFC's RPC call timeout. This will trigger failover. The question is even if after the failover, the new ANN might run into similar issue. > We can increase ZKFC to NN timeout value to mitigate this to some degree. If ZKFC can be more accurate in judgment if NN is health or not and can predict the failover will help, that will be useful. For example, we can, > 1. Have ZKFC made decision based on NN thread dump. > 2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need to acquire NN global lock; so it can go through even if NN is doing checkpointing or very busy. > Any comments? -- This message was sent by Atlassian JIRA (v6.2#6252)