Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9579F104FD for ; Tue, 3 Feb 2015 00:27:34 +0000 (UTC) Received: (qmail 67569 invoked by uid 500); 3 Feb 2015 00:27:35 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 67448 invoked by uid 500); 3 Feb 2015 00:27:34 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 67436 invoked by uid 99); 3 Feb 2015 00:27:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Feb 2015 00:27:34 +0000 Date: Tue, 3 Feb 2015 00:27:34 +0000 (UTC) From: "Tianyin Xu (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7727) Check and verify the auto-fence settings to prevent failures of auto-failover MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Tianyin Xu created HDFS-7727: -------------------------------- Summary: Check and verify the auto-fence settings to prevent failures of auto-failover Key: HDFS-7727 URL: https://issues.apache.org/jira/browse/HDFS-7727 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.5.1, 2.6.0, 2.4.1 Reporter: Tianyin Xu Sorry for reporting similar problems, but the problems resides in different components, and this one has more severe consequence (well, this's my last report of this type of problems). ============================ Problem ------------------------------------------------- The problem is similar as the following issues resolved in Yarn, https://issues.apache.org/jira/browse/YARN-2165 https://issues.apache.org/jira/browse/YARN-2166 and reported (by me) in HDFS EditLogTailer, https://issues.apache.org/jira/browse/HDFS-7726 Basically, the configuration settings is not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would impair the corresponding components (since the exceptions are not caught). In this case, the values are used in auto-failover so you won't notice the errors until one of the NameNode fails and triggers the fence procedure in the auto-failover process. ============================ Parameters ------------------------------------------------- In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java "dfs.ha.fencing.ssh.connect-timeout"; "dfs.ha.fencing.ssh.private-key-files" They are used in the tryFence() function for auto-fencing. Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and impair autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover successfully. For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method. I think actively checking the settings in the constructor of the class (in the same way as YARN-2165, YARN-2166, HDFS-7726) should be able to fix the problems. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)