Rakesh,

Thank you very much. I missed it. I hadn't "fuser" command on my nodes.
I've just installed it. ZKFC became work properly!

Best regards,
Alexandr

On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan <rakeshr@apache.org> wrote:
Hi Alexandr,

I could see the following warning message in your logs and is the reason for unsuccessful fencing. Could you please check 'fuser' command execution in your system.

2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort: PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser: command not found
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from hadoopActiveMaster port 22

Also, I'd suggest to visit https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html page to understand more about the fencing logic. In this page you can search for "dfs.ha.fencing.methods" configuration.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <alexandr.porunov@gmail.com> wrote:
Hello, 

I have a problem with ZKFC.
I have configured High Availability for Hadoop with QJM.
The problem is that when I turn off the active master node (or kill the namenode process) standby node does not want to change its status from standby to active. So it continues to be the standby node.

I was watching the log file of ZKFC when I turned off the active node. It started trying to connect to the active node (which already died) to change its status from active to standby. 
But the active node already died, so it is impossible to connect to the dead active master node.
Then I turned on the active master node. After that my standby node connected to the old active master node and changed the status of the active node from active to standby and the status of standby node from standby to active.

It is really strange. After the crash of the active node the ZKFC wants to connect to the dead node. Before connection is established ZKFC doesn't want to change the status of standby node to active.

Why is it happens?

Here my log from zkfc (I cut it because it repeats all the time. After this part of logs it logger writes the same thing):

2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
6f6f704163746976654d617374657220d43e28d33e
2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020 standby (unable to connect)
java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to hadoopActiveMaster:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
        at org.apache.hadoop.ipc.Client.call(Client.java:1479)
        at org.apache.hadoop.ipc.Client.call(Client.java:1412)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
        at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
        at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
        at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
        at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
        at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
        at org.apache.hadoop.ipc.Client.call(Client.java:1451)
        ... 14 more
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to hadoopActiveMaster...
2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoopActiveMaster port 22
2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established
2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string: SSH-2.0-OpenSSH_6.6.1
2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string: SSH-2.0-JSCH-0.1.42
2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers: aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-ctr is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-cbc is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT sent
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: server->client aes128-ctr hmac-md5 none
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: client->server aes128-ctr hmac-md5 none
2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXDH_INIT sent
2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: expecting SSH_MSG_KEXDH_REPLY
2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: ssh_rsa_verify: signature true
2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts.
2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS sent
2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS received
2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_REQUEST sent
2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received
2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can continue: gssapi-with-mic,publickey,keyboard-interactive,password
2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: gssapi-with-mic
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can continue: publickey,keyboard-interactive,password
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: publickey
2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded (publickey).
2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected to hadoopActiveMaster
2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking for process running on port 8020
2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort: PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser: command not found
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from hadoopActiveMaster port 22
2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/192.168.0.80:8020
        at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
        at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught an exception, leaving main loop due to Socket closed
2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session: 0x35602bbb71e0002 closed
2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will not attempt to authenticate using SASL (unknown error)
2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoopActiveMaster/192.168.0.80:2181, initiating session
2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoopActiveMaster/192.168.0.80:2181, sessionid = 0x15602bba9e00003, negotiated timeout = 5000
2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.


Please, help me to solve the problem with the configuration of Hadoop HA

Sincerely,
Alexandr