hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakesh Radhakrishnan <rake...@apache.org>
Subject Re: ZKFC do not work in Hadoop HA
Date Tue, 19 Jul 2016 14:29:37 GMT
Hi Alexandr,

I could see the following warning message in your logs and is the reason
for unsuccessful fencing. Could you please check 'fuser' command execution
in your system.

2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
command not found
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Disconnecting from hadoopActiveMaster port 22

Also, I'd suggest to visit
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
page to understand more about the fencing logic. In this page you can
search for "*dfs.ha.fencing.methods*" configuration.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
alexandr.porunov@gmail.com> wrote:

> Hello,
>
> I have a problem with ZKFC.
> I have configured High Availability for Hadoop with QJM.
> The problem is that when I turn off the active master node (or kill the
> namenode process) standby node does not want to change its status from
> standby to active. So it continues to be the standby node.
>
> I was watching the log file of ZKFC when I turned off the active node. It
> started trying to connect to the active node (which already died) to change
> its status from active to standby.
> But the active node already died, so it is impossible to connect to the
> dead active master node.
> Then I turned on the active master node. After that my standby node
> connected to the old active master node and changed the status of the
> active node from active to standby and the status of standby node from
> standby to active.
>
> It is really strange. After the crash of the active node the ZKFC wants to
> connect to the dead node. Before connection is established ZKFC doesn't
> want to change the status of standby node to active.
>
> Why is it happens?
>
> Here my log from zkfc (I cut it because it repeats all the time. After
> this part of logs it logger writes the same thing):
>
> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Checking for any old active which needs to be fenced...
> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
> 6f6f704163746976654d617374657220d43e28d33e
> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
> sleepTime=1000 MILLISECONDS)
> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
> Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020
> standby (unable to connect)
> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
> hadoopActiveMaster:8020 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>         at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>         at
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>         at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>         at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>         at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>         ... 14 more
> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoopActiveMaster...
> 2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoopActiveMaster port 22
> 2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connection established
> 2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Remote version string: SSH-2.0-OpenSSH_6.6.1
> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Local version string: SSH-2.0-JSCH-0.1.42
> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> CheckCiphers:
> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes256-ctr is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes192-ctr is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes256-cbc is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes192-cbc is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> arcfour256 is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_KEXINIT sent
> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_KEXINIT received
> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> kex: server->client aes128-ctr hmac-md5 none
> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> kex: client->server aes128-ctr hmac-md5 none
> 2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_KEXDH_INIT sent
> 2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> expecting SSH_MSG_KEXDH_REPLY
> 2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> ssh_rsa_verify: signature true
> 2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts.
> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_NEWKEYS sent
> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_NEWKEYS received
> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_SERVICE_REQUEST sent
> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_SERVICE_ACCEPT received
> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Authentications that can continue:
> gssapi-with-mic,publickey,keyboard-interactive,password
> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Next authentication method: gssapi-with-mic
> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Authentications that can continue: publickey,keyboard-interactive,password
> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Next authentication method: publickey
> 2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Authentication succeeded (publickey).
> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connected to hadoopActiveMaster
> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Looking for process running on port 8020
> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
> command not found
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
> 127
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Disconnecting from hadoopActiveMaster port 22
> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/
> 192.168.0.80:8020
>         at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>         at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>         at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Caught an exception, leaving main loop due to Socket closed
> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x35602bbb71e0002 closed
> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
> sessionTimeout=5000
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will
> not attempt to authenticate using SASL (unknown error)
> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoopActiveMaster/192.168.0.80:2181,
> initiating session
> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
> sessionid = 0x15602bba9e00003, negotiated timeout = 5000
> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Session connected.
>
>
> Please, help me to solve the problem with the configuration of Hadoop HA
>
> Sincerely,
> Alexandr
>

Mime
View raw message