hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandr Porunov <alexandr.poru...@gmail.com>
Subject Re: ZKFC do not work in Hadoop HA
Date Tue, 19 Jul 2016 17:01:29 GMT
Rakesh,

Thank you very much. I missed it. I hadn't "fuser" command on my nodes.
I've just installed it. ZKFC became work properly!

Best regards,
Alexandr

On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
wrote:

> Hi Alexandr,
>
> I could see the following warning message in your logs and is the reason
> for unsuccessful fencing. Could you please check 'fuser' command
> execution in your system.
>
> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
> command not found
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
> 127
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Disconnecting from hadoopActiveMaster port 22
>
> Also, I'd suggest to visit
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
> page to understand more about the fencing logic. In this page you can
> search for "*dfs.ha.fencing.methods*" configuration.
>
> Regards,
> Rakesh
>
> On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
> alexandr.porunov@gmail.com> wrote:
>
>> Hello,
>>
>> I have a problem with ZKFC.
>> I have configured High Availability for Hadoop with QJM.
>> The problem is that when I turn off the active master node (or kill the
>> namenode process) standby node does not want to change its status from
>> standby to active. So it continues to be the standby node.
>>
>> I was watching the log file of ZKFC when I turned off the active node. It
>> started trying to connect to the active node (which already died) to change
>> its status from active to standby.
>> But the active node already died, so it is impossible to connect to the
>> dead active master node.
>> Then I turned on the active master node. After that my standby node
>> connected to the old active master node and changed the status of the
>> active node from active to standby and the status of standby node from
>> standby to active.
>>
>> It is really strange. After the crash of the active node the ZKFC wants
>> to connect to the dead node. Before connection is established ZKFC doesn't
>> want to change the status of standby node to active.
>>
>> Why is it happens?
>>
>> Here my log from zkfc (I cut it because it repeats all the time. After
>> this part of logs it logger writes the same thing):
>>
>> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Checking for any old active which needs to be fenced...
>> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
>> 6f6f704163746976654d617374657220d43e28d33e
>> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
>> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0
>> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
>> sleepTime=1000 MILLISECONDS)
>> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
>> Unable to gracefully make NameNode at hadoopActiveMaster/
>> 192.168.0.80:8020 standby (unable to connect)
>> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
>> hadoopActiveMaster:8020 failed on connection exception:
>> java.net.ConnectException: Connection refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>         at
>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>>         at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>>         at
>> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>>         at
>> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> Caused by: java.net.ConnectException: Connection refused
>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>         at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>>         ... 14 more
>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connecting to hadoopActiveMaster...
>> 2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connecting to hadoopActiveMaster port 22
>> 2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connection established
>> 2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Remote version string: SSH-2.0-OpenSSH_6.6.1
>> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Local version string: SSH-2.0-JSCH-0.1.42
>> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> CheckCiphers:
>> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes256-ctr is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes192-ctr is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes256-cbc is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes192-cbc is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> arcfour256 is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_KEXINIT sent
>> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_KEXINIT received
>> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> kex: server->client aes128-ctr hmac-md5 none
>> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> kex: client->server aes128-ctr hmac-md5 none
>> 2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_KEXDH_INIT sent
>> 2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> expecting SSH_MSG_KEXDH_REPLY
>> 2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> ssh_rsa_verify: signature true
>> 2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts.
>> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_NEWKEYS sent
>> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_NEWKEYS received
>> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_SERVICE_REQUEST sent
>> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_SERVICE_ACCEPT received
>> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Authentications that can continue:
>> gssapi-with-mic,publickey,keyboard-interactive,password
>> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Next authentication method: gssapi-with-mic
>> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Authentications that can continue: publickey,keyboard-interactive,password
>> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Next authentication method: publickey
>> 2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Authentication succeeded (publickey).
>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connected to hadoopActiveMaster
>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Looking for process running on port 8020
>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
>> command not found
>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
>> 127
>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Disconnecting from hadoopActiveMaster port 22
>> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>> fence service by any configured method.
>> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>> Exception handling the winning of election
>> java.lang.RuntimeException: Unable to fence NameNode at
>> hadoopActiveMaster/192.168.0.80:8020
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Trying to re-establish ZK session
>> 2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Caught an exception, leaving main loop due to Socket closed
>> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
>> 0x35602bbb71e0002 closed
>> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection,
>> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
>> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will
>> not attempt to authenticate using SASL (unknown error)
>> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to hadoopActiveMaster/192.168.0.80:2181,
>> initiating session
>> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
>> sessionid = 0x15602bba9e00003, negotiated timeout = 5000
>> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Session connected.
>>
>>
>> Please, help me to solve the problem with the configuration of Hadoop HA
>>
>> Sincerely,
>> Alexandr
>>
>
>

Mime
View raw message