hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tianyin Xu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7727) Check and verify the auto-fence settings to prevent failures of auto-failover
Date Tue, 03 Feb 2015 00:27:34 GMT
Tianyin Xu created HDFS-7727:
--------------------------------

             Summary: Check and verify the auto-fence settings to prevent failures of auto-failover
                 Key: HDFS-7727
                 URL: https://issues.apache.org/jira/browse/HDFS-7727
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: auto-failover
    Affects Versions: 2.5.1, 2.6.0, 2.4.1
            Reporter: Tianyin Xu


Sorry for reporting similar problems, but the problems resides in different components, and
this one has more severe consequence (well, this's my last report of this type of problems).


============================
Problem
-------------------------------------------------
The problem is similar as the following issues resolved in Yarn,
https://issues.apache.org/jira/browse/YARN-2165
https://issues.apache.org/jira/browse/YARN-2166
and reported (by me) in HDFS EditLogTailer,
https://issues.apache.org/jira/browse/HDFS-7726

Basically, the configuration settings is not checked and verified at initialization but directly
parsed and applied at runtime. Any configuration errors would impair the corresponding components
(since the exceptions are not caught). 

In this case, the values are used in auto-failover so you won't notice the errors until one
of the NameNode fails and triggers the fence procedure in the auto-failover process.

============================
Parameters
-------------------------------------------------

In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java
"dfs.ha.fencing.ssh.connect-timeout";
"dfs.ha.fencing.ssh.private-key-files"

They are used in the tryFence() function for auto-fencing. 

Any erroneous settings of these two parameters would result in uncaught exceptions that would
prevent the fencing and impair autofailover. We have verified this by setting a two-NameNode
autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover
successfully. 

For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers
and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).

For "dfs.ha.fencing.ssh.private-key-files",  the erroneous settings include non-existent private-key
file path or wrong permissions that fail jsch.addIdentity() in the createSession() method.

I think actively checking the settings in the constructor of the class (in the same way as
YARN-2165, YARN-2166, HDFS-7726) should be able to fix the problems.

Thanks! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message