hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tianyin Xu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7727) Check and verify the auto-fence settings to prevent failures of auto-failover
Date Tue, 03 Feb 2015 09:24:34 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tianyin Xu updated HDFS-7727:
-----------------------------
    Attachment: check_config_SshFenceByTcpPort.1.patch

> Check and verify the auto-fence settings to prevent failures of auto-failover
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-7727
>                 URL: https://issues.apache.org/jira/browse/HDFS-7727
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>    Affects Versions: 2.4.1, 2.6.0, 2.5.1
>            Reporter: Tianyin Xu
>         Attachments: check_config_SshFenceByTcpPort.1.patch
>
>
> ============================
> Problem
> -------------------------------------------------
> Currently, the auto-failover feature of HDFS only checks the settings of the parameter
"dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters.
> Basically, the configuration settings of other "dfs.ha.fencing" are not checked and verified
at initialization but directly parsed and applied at runtime. Any configuration errors would
prevent the auto-failover. 
> Since the values are used to deal with failures (auto-failover) so you won't notice the
errors until the active NameNode fails and triggers the fence procedure in the auto-failover
process.
> ============================
> Parameters
> -------------------------------------------------
> In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java
> "dfs.ha.fencing.ssh.connect-timeout";
> "dfs.ha.fencing.ssh.private-key-files"
> They are used in the tryFence() function for auto-fencing. 
> Any erroneous settings of these two parameters would result in uncaught exceptions that
would prevent the fencing and autofailover. We have verified this by setting a two-NameNode
autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover.

> For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted
integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).
> For "dfs.ha.fencing.ssh.private-key-files",  the erroneous settings include non-existent
private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession()
method.
> The following gives one example of the failure casued by misconfiguring the "dfs.ha.fencing.ssh.private-key-files"
parameter.
> {code}
> 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service
Fencing Process... ======
> 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create
SSH session
> com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax
(No such file or directory)
>         at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98)
>         at com.jcraft.jsch.JSch.addIdentity(JSch.java:206)
>         at com.jcraft.jsch.JSch.addIdentity(JSch.java:192)
>         at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122)
>         at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91)
>         at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>         at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>         at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>         at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>         at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>         at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
>         at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
>         at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or
directory)
>         at java.io.FileInputStream.open(Native Method)
>         at java.io.FileInputStream.<init>(FileInputStream.java:146)
>         at java.io.FileInputStream.<init>(FileInputStream.java:101)
>         at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83)
>         ... 14 more
> {code}
> ============================
> Solution (the patch)
> -------------------------------------------------
> Check the configuration settings in the checkArgs() function. Currently, checkArg() only
checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the
other "dfs.ha.fencing.*" parameters.
> {code:title=SshFenceByTcpPort.java|borderStyle=solid}
>   /**
>    * Verify that the argument, if given, in the conf is parseable.
>    */
>   @Override
>   public void checkArgs(String argStr) throws BadFencingConfigurationException {
>     if (argStr != null) {
>       new Args(argStr);
>     }
>     <= Insert the checkers here (see the patch attached)
>   }
> {code}
> The detailed patch is shown below.
> {code}
> @@ -76,6 +77,23 @@
>      if (argStr != null) {
>        new Args(argStr);
>      }
> +
> +    //The configuration could be empty (e.g., called from DFSHAAdmin)
> +    if(getConf().size() > 0) {
> +      //check ssh.connect-timeout
> +      if(getSshConnectTimeout() <= 0)
> +        throw new BadFencingConfigurationException(
> +            CONF_CONNECT_TIMEOUT_KEY + 
> +            "property value should be positive and non-zero");
> +
> +      //check the settings of dfs.ha.fencing.ssh.private-key-files
> +      for (String keyFilePath : getKeyFiles()) {
> +        File keyFile = new File(keyFilePath);
> +        if(!keyFile.isFile() || !keyFile.canRead())
> +            throw new BadFencingConfigurationException(
> +                "The configured private key file is invalid: " + keyFilePath);
> +      }
> +    }
>    }
>  
>    @Override
> {code}
> Thanks! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message