hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandr Porunov <alexandr.poru...@gmail.com>
Subject ZKFC do not work in Hadoop HA
Date Tue, 19 Jul 2016 13:52:33 GMT
Hello,

I have a problem with ZKFC.
I have configured High Availability for Hadoop with QJM.
The problem is that when I turn off the active master node (or kill the
namenode process) standby node does not want to change its status from
standby to active. So it continues to be the standby node.

I was watching the log file of ZKFC when I turned off the active node. It
started trying to connect to the active node (which already died) to change
its status from active to standby.
But the active node already died, so it is impossible to connect to the
dead active master node.
Then I turned on the active master node. After that my standby node
connected to the old active master node and changed the status of the
active node from active to standby and the status of standby node from
standby to active.

It is really strange. After the crash of the active node the ZKFC wants to
connect to the dead node. Before connection is established ZKFC doesn't
want to change the status of standby node to active.

Why is it happens?

Here my log from zkfc (I cut it because it repeats all the time. After this
part of logs it logger writes the same thing):

2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists: 0a0a68612d636c757374657212036e6e311a12686164
6f6f704163746976654d617374657220d43e28d33e
2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1000 MILLISECONDS)
2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020
standby (unable to connect)
java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
hadoopActiveMaster:8020 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
        at org.apache.hadoop.ipc.Client.call(Client.java:1479)
        at org.apache.hadoop.ipc.Client.call(Client.java:1412)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
        at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
        at
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
        at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
        at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
        at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
        at
org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
        at org.apache.hadoop.ipc.Client.call(Client.java:1451)
        ... 14 more
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
Beginning Service Fencing Process... ======
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Connecting to hadoopActiveMaster...
2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connecting to hadoopActiveMaster port 22
2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connection established
2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Remote version string: SSH-2.0-OpenSSH_6.6.1
2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Local version string: SSH-2.0-JSCH-0.1.42
2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
CheckCiphers:
aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-ctr is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-ctr is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-cbc is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-cbc is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
arcfour256 is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT sent
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT received
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
kex: server->client aes128-ctr hmac-md5 none
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
kex: client->server aes128-ctr hmac-md5 none
2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXDH_INIT sent
2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
expecting SSH_MSG_KEXDH_REPLY
2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
ssh_rsa_verify: signature true
2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts.
2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS sent
2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS received
2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_REQUEST sent
2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_ACCEPT received
2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue:
gssapi-with-mic,publickey,keyboard-interactive,password
2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Next authentication method: gssapi-with-mic
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue: publickey,keyboard-interactive,password
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Next authentication method: publickey
2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentication succeeded (publickey).
2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Connected to hadoopActiveMaster
2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Looking for process running on port 8020
2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
command not found
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Disconnecting from hadoopActiveMaster port 22
2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
fence service by any configured method.
2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/
192.168.0.80:8020
        at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
        at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Trying to re-establish ZK session
2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Caught an exception, leaving main loop due to Socket closed
2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
0x35602bbb71e0002 closed
2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection,
connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
sessionTimeout=5000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will not
attempt to authenticate using SASL (unknown error)
2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoopActiveMaster/192.168.0.80:2181, initiating
session
2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
sessionid = 0x15602bba9e00003, negotiated timeout = 5000
2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Session connected.


Please, help me to solve the problem with the configuration of Hadoop HA

Sincerely,
Alexandr

Mime
View raw message