hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandr Porunov <alexandr.poru...@gmail.com>
Subject ZKFC fencing problem after the active node crash
Date Tue, 19 Jul 2016 19:22:16 GMT
Hello,

I have configured Hadoop HA cluster. It works like in tutorials. If I kill
Namenode process with command "kill -9 NameNodeProcessId" my standby node
changes its state to active. But if I power off active node then standby
node can't change its state to active because it trys to connect to the
crashed node by using SSH.

This parameter doesn't work:
<property>
        <name>dfs.ha.fencing.ssh.connect-timeout</name>
        <value>3000</value>
</property>

I read from the documentation that it is 5 second by default. But even
after 5 minutes standby node continue try to connect to crashed node. I set
it manually for 3 second but it still doesn't work. So, if we just kill
namenode process our cluster works but if we crash active node our cluster
become unavailable.

*Here is the part of ZKFC logs (After the crash the logger writes the same
information infinitely)*:
2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: ======
Beginning Service Fencing Process... ======
2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Connecting to hadoopActiveMaster...
2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connecting to hadoopActiveMaster port 22
2016-07-19 20:56:27,148 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable
to connect to hadoopActiveMaster as user hadoop
com.jcraft.jsch.JSchException: timeout: socket is not established
        at com.jcraft.jsch.Util.createSocket(Util.java:386)
        at com.jcraft.jsch.Session.connect(Session.java:182)
        at
org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
        at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
        at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
        at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-07-19 20:56:27,149 WARN org.apache.hadoop.ha.NodeFencer: Fencing
method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsucce
ssful.
2016-07-19 20:56:27,149 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
fence service by any configured method.
2016-07-19 20:56:27,150 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/
192.168.0.80:8020
        at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
        at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-07-19 20:56:27,150 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Trying to re-establish ZK session
2016-07-19 20:56:27,177 INFO org.apache.zookeeper.ZooKeeper: Session:
0x3560443c6e30003 closed
2016-07-19 20:56:28,183 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection,
connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
sessionTimeout=5000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@d49b070
2016-07-19 20:56:28,186 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoopStandby/192.168.0.81:2181. Will not
attempt to authenticate using SASL (unknown error)
2016-07-19 20:56:28,187 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoopStandby/192.168.0.81:2181, initiating
session
2016-07-19 20:56:28,197 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoopStandby/192.168.0.81:2181, sessionid
= 0x2560443c4670003, negotiated timeout = 5000
2016-07-19 20:56:28,199 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2016-07-19 20:56:28,203 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Session connected.
2016-07-19 20:56:28,207 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2016-07-19 20:56:28,210 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists:
0a096d79636c757374657212036d6e311a126861646f6f704163746976654d617374657220d43e28d33e
2016-07-19 20:56:28,213 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
2016-07-19 20:56:48,232 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0 time(s);
maxRetries=1
2016-07-19 20:57:08,242 WARN org.apache.hadoop.ha.FailoverController:
Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020
standby (unable to connect)
org.apache.hadoop.net.ConnectTimeoutException: Call From hadoopStandby/
192.168.0.81 to hadoopActiveMaster:8020 failed on socket timeout exception:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=hadoopActiveMaster/192.168.0.80:8020]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
        at org.apache.hadoop.ipc.Client.call(Client.java:1479)
        at org.apache.hadoop.ipc.Client.call(Client.java:1412)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
        at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
        at
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
        at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
        at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=hadoopActiveMaster/192.168.0.80:8020]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
        at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
        at
org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
        at org.apache.hadoop.ipc.Client.call(Client.java:1451)
        ... 14 more
2016-07-19 20:57:08,243 INFO org.apache.hadoop.ha.NodeFencer: ======
Beginning Service Fencing Process... ======

*Here is my hdfs-site.xml*:
<configuration>
    <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>mn1,mn2</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.mn1</name>
        <value>hadoopActiveMaster:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.mn2</name>
        <value>hadoopStandby:8020</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.mn1</name>
        <value>hadoopActiveMaster:50070</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.mn2</name>
        <value>hadoopStandby:50070</value>
    </property>
    <property>
        <name>ha.zookeeper.quorum</name>

<value>hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181</value>
    </property>
    <property>
        <name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://hadoopActiveMaster:8485;hadoopStandby:8485;hadoopSlave1:8485/mycluster</value>
    </property>
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/usr/hadoop/.ssh/id_rsa</value>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.connect-timeout</name>
        <value>3000</value>
    </property>
</configuration>

*Here is my core-site.xml*:
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://mycluster</value>
  </property>
  <property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/var/hadoop/jn</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/hadoop/tmp</value>
  </property>
</configuration>

*Here is my zoo.cfg:*
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/zookeeper/data
dataLogDir=/var/zookeeper/logs
clientPort=2181

server.1=hadoopActiveMaster:2888:3888
server.2=hadoopStandby:2888:3888
server.3=hadoopSlave1:2888:3888

*Here is my /etc/hosts:*
127.0.0.1   me
192.168.0.80 hadoopActiveMaster
192.168.0.81 hadoopStandby
192.168.0.82 hadoopSlave1
192.168.0.83 hadoopSlave2

Please, help me to solve this problem

Sincerely,
Alexandr

Mime
View raw message