hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakesh Radhakrishnan <rake...@apache.org>
Subject Re: ZKFC do not work in Hadoop HA
Date Tue, 19 Jul 2016 17:39:51 GMT
Good to hear the problem is resolved and able to continue.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 10:31 PM, Alexandr Porunov <
alexandr.porunov@gmail.com> wrote:

> Rakesh,
>
> Thank you very much. I missed it. I hadn't "fuser" command on my nodes.
> I've just installed it. ZKFC became work properly!
>
> Best regards,
> Alexandr
>
> On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
> wrote:
>
>> Hi Alexandr,
>>
>> I could see the following warning message in your logs and is the reason
>> for unsuccessful fencing. Could you please check 'fuser' command
>> execution in your system.
>>
>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
>> command not found
>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
>> 127
>> 2016-07-19 14:43:23,706 INFO
>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from
>> hadoopActiveMaster port 22
>>
>> Also, I'd suggest to visit
>> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
>> page to understand more about the fencing logic. In this page you can
>> search for "*dfs.ha.fencing.methods*" configuration.
>>
>> Regards,
>> Rakesh
>>
>> On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
>> alexandr.porunov@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a problem with ZKFC.
>>> I have configured High Availability for Hadoop with QJM.
>>> The problem is that when I turn off the active master node (or kill the
>>> namenode process) standby node does not want to change its status from
>>> standby to active. So it continues to be the standby node.
>>>
>>> I was watching the log file of ZKFC when I turned off the active node.
>>> It started trying to connect to the active node (which already died) to
>>> change its status from active to standby.
>>> But the active node already died, so it is impossible to connect to the
>>> dead active master node.
>>> Then I turned on the active master node. After that my standby node
>>> connected to the old active master node and changed the status of the
>>> active node from active to standby and the status of standby node from
>>> standby to active.
>>>
>>> It is really strange. After the crash of the active node the ZKFC wants
>>> to connect to the dead node. Before connection is established ZKFC doesn't
>>> want to change the status of standby node to active.
>>>
>>> Why is it happens?
>>>
>>> Here my log from zkfc (I cut it because it repeats all the time. After
>>> this part of logs it logger writes the same thing):
>>>
>>> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Checking for any old active which needs to be fenced...
>>> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
>>> 6f6f704163746976654d617374657220d43e28d33e
>>> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
>>> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
>>> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
>>> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>>> MILLISECONDS)
>>> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
>>> Unable to gracefully make NameNode at hadoopActiveMaster/
>>> 192.168.0.80:8020 standby (unable to connect)
>>> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
>>> hadoopActiveMaster:8020 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused
>>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>>         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>>>         at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>>>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>>>         at
>>> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>>>         at
>>> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>>> Caused by: java.net.ConnectException: Connection refused
>>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>         at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>>         at
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>>>         ... 14 more
>>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
>>> Beginning Service Fencing Process... ======
>>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying
>>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>>> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connecting to hadoopActiveMaster...
>>> 2016-07-19 14:43:23,066 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to
>>> hadoopActiveMaster port 22
>>> 2016-07-19 14:43:23,073 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established
>>> 2016-07-19 14:43:23,088 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string:
>>> SSH-2.0-OpenSSH_6.6.1
>>> 2016-07-19 14:43:23,089 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string:
>>> SSH-2.0-JSCH-0.1.42
>>> 2016-07-19 14:43:23,089 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers:
>>> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-ctr is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-cbc is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT sent
>>> 2016-07-19 14:43:23,446 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received
>>> 2016-07-19 14:43:23,446 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: server->client aes128-ctr
>>> hmac-md5 none
>>> 2016-07-19 14:43:23,446 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: client->server aes128-ctr
>>> hmac-md5 none
>>> 2016-07-19 14:43:23,478 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXDH_INIT sent
>>> 2016-07-19 14:43:23,479 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: expecting SSH_MSG_KEXDH_REPLY
>>> 2016-07-19 14:43:23,493 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: ssh_rsa_verify: signature true
>>> 2016-07-19 14:43:23,495 WARN
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Permanently added
>>> 'hadoopActiveMaster' (RSA) to the list of known hosts.
>>> 2016-07-19 14:43:23,495 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS sent
>>> 2016-07-19 14:43:23,495 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS received
>>> 2016-07-19 14:43:23,519 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_REQUEST sent
>>> 2016-07-19 14:43:23,519 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received
>>> 2016-07-19 14:43:23,524 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can
>>> continue: gssapi-with-mic,publickey,keyboard-interactive,password
>>> 2016-07-19 14:43:23,524 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method:
>>> gssapi-with-mic
>>> 2016-07-19 14:43:23,527 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can
>>> continue: publickey,keyboard-interactive,password
>>> 2016-07-19 14:43:23,527 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method:
>>> publickey
>>> 2016-07-19 14:43:23,617 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded
>>> (publickey).
>>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connected to hadoopActiveMaster
>>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Looking for process running on port 8020
>>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
>>> command not found
>>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
>>> 127
>>> 2016-07-19 14:43:23,706 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from
>>> hadoopActiveMaster port 22
>>> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>>> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>>> fence service by any configured method.
>>> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>>> Exception handling the winning of election
>>> java.lang.RuntimeException: Unable to fence NameNode at
>>> hadoopActiveMaster/192.168.0.80:8020
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>>> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Trying to re-establish ZK session
>>> 2016-07-19 14:43:23,725 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught an exception, leaving
>>> main loop due to Socket closed
>>> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x35602bbb71e0002 closed
>>> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection,
>>> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
>>> sessionTimeout=5000
>>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
>>> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
>>> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will
>>> not attempt to authenticate using SASL (unknown error)
>>> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to hadoopActiveMaster/192.168.0.80:2181,
>>> initiating session
>>> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
>>> sessionid = 0x15602bba9e00003, negotiated timeout = 5000
>>> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn:
>>> EventThread shut down
>>> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Session connected.
>>>
>>>
>>> Please, help me to solve the problem with the configuration of Hadoop HA
>>>
>>> Sincerely,
>>> Alexandr
>>>
>>
>>
>

Mime
View raw message