From user-return-22824-apmail-hadoop-common-user-archive=hadoop.apache.org@hadoop.apache.org Tue Jul 19 17:01:48 2016 Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5776E1983E for ; Tue, 19 Jul 2016 17:01:48 +0000 (UTC) Received: (qmail 60552 invoked by uid 500); 19 Jul 2016 17:01:43 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 60465 invoked by uid 500); 19 Jul 2016 17:01:43 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 60446 invoked by uid 99); 19 Jul 2016 17:01:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jul 2016 17:01:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3C7961A82AC for ; Tue, 19 Jul 2016 17:01:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.282 X-Spam-Level: ** X-Spam-Status: No, score=2.282 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, TRACKER_ID=1.102, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 36p1LVKU7n42 for ; Tue, 19 Jul 2016 17:01:37 +0000 (UTC) Received: from mail-it0-f41.google.com (mail-it0-f41.google.com [209.85.214.41]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 6D13F611FD for ; Tue, 19 Jul 2016 17:01:37 +0000 (UTC) Received: by mail-it0-f41.google.com with SMTP id u186so96971747ita.0 for ; Tue, 19 Jul 2016 10:01:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=GM2TaSMtmtFhKe434oiQz+lkwqH0pIVAtLZwIbCZM6w=; b=Gx/o5HSSb1BrrSC4NOVM85rSGU+sUqAHe7sktdqq0yHVCQ7AVbVwYmeYcsejXQPhLx 3zJwwRBk2iy0WaBKMMbs+5bNQvGHzTvyPCZCMBN3iwzpKDQ8dIoZGbdFBzo9b+40/kcM L3o2hyutFEKpNGW1s2ZaCt66vVlS6lCrl37g7zIepl97runP3a6s4zPhnpf+pf0W5JGZ 2TazoSsZ69agETxaXyAYt2++Lib/QaHh7ecJ776FJwpEoQTOB21jMGHCtYMyLDPinHxj IJBfmItrXjhIkZRwB3xNL4Ozov/7NV/GnLrSD9zrUVwBLRCw5iaeEobZBflgaAYp+2il +/hA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=GM2TaSMtmtFhKe434oiQz+lkwqH0pIVAtLZwIbCZM6w=; b=LQ4LAh03P1vG6eWThvclJLSybAGFM216pnSrOiniyV+gP2VB9su6Qnq6CGvDYVR+is ZxQUHRISzN1W6S+4lKzFp/fTAblIulNkbXe3ein8zfGJ7FGj0+dylNYzrqMAjR6o+7ta U2SpEqBTJNihnWOUYP9EDiZ1IB4uVgeMmQxtqRqXBkQwSycatAAkoPGOdA/VC67AIgi3 Uoy6DvlD7iB+MtbmZRbr58XAH9/seW3c7TK1bdQaKT8geljf9eKjNGwTRDLftExXSb91 5xB9aZj7ysQilaVOhg+m33ojVURa/E+loYq3Zd6g0I9acXRwF4wKQJgwnq9oQSYc85HR erlA== X-Gm-Message-State: ALyK8tJ53qpOVohJYbU4G2riYjEkzHcaExfGuk2ZIODpu4xOlULdDa5mXxkZMoOKD7JM34Z1pz/VRttyg8tVuw== X-Received: by 10.36.55.2 with SMTP id r2mr55601953itr.73.1468947690603; Tue, 19 Jul 2016 10:01:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.33.76 with HTTP; Tue, 19 Jul 2016 10:01:29 -0700 (PDT) In-Reply-To: References: From: Alexandr Porunov Date: Tue, 19 Jul 2016 20:01:29 +0300 Message-ID: Subject: Re: ZKFC do not work in Hadoop HA To: Rakesh Radhakrishnan Cc: "user.hadoop" Content-Type: multipart/alternative; boundary=001a1140cd744acce90538000969 --001a1140cd744acce90538000969 Content-Type: text/plain; charset=UTF-8 Rakesh, Thank you very much. I missed it. I hadn't "fuser" command on my nodes. I've just installed it. ZKFC became work properly! Best regards, Alexandr On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan wrote: > Hi Alexandr, > > I could see the following warning message in your logs and is the reason > for unsuccessful fencing. Could you please check 'fuser' command > execution in your system. > > 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort: > PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser: > command not found > 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: > 127 > 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: > Disconnecting from hadoopActiveMaster port 22 > > Also, I'd suggest to visit > https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html > page to understand more about the fencing logic. In this page you can > search for "*dfs.ha.fencing.methods*" configuration. > > Regards, > Rakesh > > On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov < > alexandr.porunov@gmail.com> wrote: > >> Hello, >> >> I have a problem with ZKFC. >> I have configured High Availability for Hadoop with QJM. >> The problem is that when I turn off the active master node (or kill the >> namenode process) standby node does not want to change its status from >> standby to active. So it continues to be the standby node. >> >> I was watching the log file of ZKFC when I turned off the active node. It >> started trying to connect to the active node (which already died) to change >> its status from active to standby. >> But the active node already died, so it is impossible to connect to the >> dead active master node. >> Then I turned on the active master node. After that my standby node >> connected to the old active master node and changed the status of the >> active node from active to standby and the status of standby node from >> standby to active. >> >> It is really strange. After the crash of the active node the ZKFC wants >> to connect to the dead node. Before connection is established ZKFC doesn't >> want to change the status of standby node to active. >> >> Why is it happens? >> >> Here my log from zkfc (I cut it because it repeats all the time. After >> this part of logs it logger writes the same thing): >> >> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: >> Checking for any old active which needs to be fenced... >> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector: >> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164 >> 6f6f704163746976654d617374657220d43e28d33e >> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController: >> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020 >> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0 >> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, >> sleepTime=1000 MILLISECONDS) >> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController: >> Unable to gracefully make NameNode at hadoopActiveMaster/ >> 192.168.0.80:8020 standby (unable to connect) >> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to >> hadoopActiveMaster:8020 failed on connection exception: >> java.net.ConnectException: Connection refused; For more details see: >> http://wiki.apache.org/hadoop/ConnectionRefused >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> Method) >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) >> at >> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) >> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) >> at org.apache.hadoop.ipc.Client.call(Client.java:1479) >> at org.apache.hadoop.ipc.Client.call(Client.java:1412) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) >> at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source) >> at >> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112) >> at >> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172) >> at >> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514) >> at >> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505) >> at >> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) >> at >> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) >> Caused by: java.net.ConnectException: Connection refused >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) >> at >> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) >> at >> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614) >> at >> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712) >> at >> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) >> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528) >> at org.apache.hadoop.ipc.Client.call(Client.java:1451) >> ... 14 more >> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ====== >> Beginning Service Fencing Process... ====== >> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying >> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) >> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >> Connecting to hadoopActiveMaster... >> 2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Connecting to hadoopActiveMaster port 22 >> 2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Connection established >> 2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Remote version string: SSH-2.0-OpenSSH_6.6.1 >> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Local version string: SSH-2.0-JSCH-0.1.42 >> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> CheckCiphers: >> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256 >> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes256-ctr is not available. >> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes192-ctr is not available. >> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes256-cbc is not available. >> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes192-cbc is not available. >> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> arcfour256 is not available. >> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_KEXINIT sent >> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_KEXINIT received >> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> kex: server->client aes128-ctr hmac-md5 none >> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> kex: client->server aes128-ctr hmac-md5 none >> 2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_KEXDH_INIT sent >> 2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> expecting SSH_MSG_KEXDH_REPLY >> 2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> ssh_rsa_verify: signature true >> 2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts. >> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_NEWKEYS sent >> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_NEWKEYS received >> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_SERVICE_REQUEST sent >> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_SERVICE_ACCEPT received >> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Authentications that can continue: >> gssapi-with-mic,publickey,keyboard-interactive,password >> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Next authentication method: gssapi-with-mic >> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Authentications that can continue: publickey,keyboard-interactive,password >> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Next authentication method: publickey >> 2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Authentication succeeded (publickey). >> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >> Connected to hadoopActiveMaster >> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >> Looking for process running on port 8020 >> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort: >> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser: >> command not found >> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: >> 127 >> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Disconnecting from hadoopActiveMaster port 22 >> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing >> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. >> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to >> fence service by any configured method. >> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector: >> Exception handling the winning of election >> java.lang.RuntimeException: Unable to fence NameNode at >> hadoopActiveMaster/192.168.0.80:8020 >> at >> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533) >> at >> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505) >> at >> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) >> at >> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) >> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector: >> Trying to re-establish ZK session >> 2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Caught an exception, leaving main loop due to Socket closed >> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session: >> 0x35602bbb71e0002 closed >> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating >> client connection, >> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181 >> sessionTimeout=5000 >> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6 >> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening >> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will >> not attempt to authenticate using SASL (unknown error) >> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket >> connection established to hadoopActiveMaster/192.168.0.80:2181, >> initiating session >> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session >> establishment complete on server hadoopActiveMaster/192.168.0.80:2181, >> sessionid = 0x15602bba9e00003, negotiated timeout = 5000 >> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread >> shut down >> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector: >> Session connected. >> >> >> Please, help me to solve the problem with the configuration of Hadoop HA >> >> Sincerely, >> Alexandr >> > > --001a1140cd744acce90538000969 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Rakesh,
=

Thank you very much. I missed it. I hadn't "fuse= r" command on my nodes.
I've just installed it. ZKFC became work properly!

Best regards,
Alexandr

On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan <rakeshr@apa= che.org> wrote:
Hi Alexa= ndr,

I could see the following warning message in your l= ogs and is the reason for unsuccessful fencing. Could you please check '= ;fuser' command execution in yo= ur system.

2016-07-19=C2=A014:43:23,705 WARN org.a= pache.hadoop.ha.SshFenceByTcpPort: PATH=3D$PATH:/sbin:/usr/sbin fuser -v -k= -n tcp 8020 via ssh: bash: fuser: command not found
2016-07-19=C2=A014:43:23,706 INFO = org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19=C2=A014:43:23,706 INFO org.a= pache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from hadoopActiveMast= er port 22

Also, I'd suggest to visit=C2=A0https://hadoop.apache.org/do= cs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html<= /a> page to understand more about the fencing logic. In this page you can s= earch for "dfs.ha.fencing.methods<= /b>" configuration.

<= div class=3D"h5">

= On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <alexandr.porun= ov@gmail.com> wrote:
Hello,=C2=A0

I have a problem with ZKF= C.
I have configured High Availability for Hadoop with QJM.
=
The problem is that when I turn off the active master node (or kill th= e namenode process) standby node does not want to change its status from st= andby to active. So it continues to be the standby node.

I was watching the log file of ZKFC when I turned off the active nod= e. It started trying to connect to the active node (which already died) to = change its status from active to standby.=C2=A0
But the active no= de already died, so it is impossible to connect to the dead active master n= ode.
Then I turned on the active master node. After that my stand= by node connected to the old active master node and changed the status of t= he active node from active to standby and the status of standby node from s= tandby to active.

It is really strange. After the = crash of the active node the ZKFC wants to connect to the dead node. Before= connection is established ZKFC doesn't want to change the status of st= andby node to active.

Why is it happens?

Here my log from zkfc (I cut it because it repeats all the = time. After this part of logs it logger writes the same thing):
<= br>
2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStand= byElector: Checking for any old active which needs to be fenced...
2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector: O= ld node exists: 0a0a68612d636c757374657212036e6e311a12686164
6f6f= 704163746976654d617374657220d43e28d33e
2016-07-19 14:43:21,978 IN= FO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at had= oopActiveMaster/192.= 168.0.80:8020
2016-07-19 14:43:22,995 INFO org.apache.hadoop.= ipc.Client: Retrying connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried= 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries= =3D1, sleepTime=3D1000 MILLISECONDS)
2016-07-19 14:43:23,001 WARN= org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNod= e at hadoopActiveMaster/192.168.0.80:8020 standby (unable to connect)
java.net.Co= nnectException: Call From hadoopStandby/192.168.0.81 to hadoopActiveMaster:8020 failed on connec= tion exception: java.net.ConnectException: Connection refused; For more det= ails see: =C2=A0http://wiki.apache.org/hadoop/ConnectionRefused
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at sun.reflect.NativeConstructorAccessorI= mpl.newInstance0(Native Method)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at su= n.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccess= orImpl.java:62)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at sun.reflect.Delega= tingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.j= ava:45)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.lang.reflect.Construc= tor.newInstance(Constructor.java:423)
=C2=A0 =C2=A0 =C2=A0 =C2=A0= at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.net.NetUtils.wrapExce= ption(NetUtils.java:732)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apach= e.hadoop.ipc.Client.call(Client.java:1479)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at org.apache.hadoop.ipc.Client.call(Client.java:1412)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker= .invoke(ProtobufRpcEngine.java:229)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 a= t com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.protocolPB.HAServiceProtoc= olClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTra= nslatorPB.java:112)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.had= oop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ZKFailoverContro= ller.doFence(ZKFailoverController.java:514)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailov= erController.java:505)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.= hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ZKFailoverContr= oller$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
<= div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ActiveStandbyElecto= r.fenceOldActive(ActiveStandbyElector.java:910)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(Active= StandbyElector.java:809)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apach= e.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:41= 8)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.zookeeper.ClientCnxn= $EventThread.processEvent(ClientCnxn.java:599)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.ja= va:498)
Caused by: java.net.ConnectException: Connection refused<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at sun.nio.ch.SocketChannelImpl.check= Connect(Native Method)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at sun.nio.ch.= SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.net.SocketIOWithTimeout.conne= ct(SocketIOWithTimeout.java:206)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at o= rg.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.net.NetUtils.connect(NetUtils.jav= a:495)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ipc.Clien= t$Connection.setupConnection(Client.java:614)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client= .java:712)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ipc.C= lient$Connection.access$2900(Client.java:375)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ipc.Client.call(= Client.java:1451)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ... 14 more
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: =3D=3D=3D= =3D=3D=3D Beginning Service Fencing Process... =3D=3D=3D=3D=3D=3D
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying metho= d 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2016-07-19 14= :43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to hadoo= pActiveMaster...
2016-07-19 14:43:23,066 INFO org.apache.hadoop.h= a.SshFenceByTcpPort.jsch: Connecting to hadoopActiveMaster port 22
2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:= Connection established
2016-07-19 14:43:23,088 INFO org.apache.h= adoop.ha.SshFenceByTcpPort.jsch: Remote version string: SSH-2.0-OpenSSH_6.6= .1
2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTc= pPort.jsch: Local version string: SSH-2.0-JSCH-0.1.42
2016-07-19 = 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers= : aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ct= r,arcfour,arcfour128,arcfour256
2016-07-19 14:43:23,445 INFO org.= apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available.
=
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.js= ch: aes192-ctr is not available.
2016-07-19 14:43:23,445 INFO org= .apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.j= sch: aes192-cbc is not available.
2016-07-19 14:43:23,445 INFO or= g.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.= jsch: SSH_MSG_KEXINIT sent
2016-07-19 14:43:23,446 INFO org.apach= e.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received
2016= -07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: = server->client aes128-ctr hmac-md5 none
2016-07-19 14:43:23,44= 6 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: client->server = aes128-ctr hmac-md5 none
2016-07-19 14:43:23,478 INFO org.apache.= hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXDH_INIT sent
2016-07= -19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: expectin= g SSH_MSG_KEXDH_REPLY
2016-07-19 14:43:23,493 INFO org.apache.had= oop.ha.SshFenceByTcpPort.jsch: ssh_rsa_verify: signature true
201= 6-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Perm= anently added 'hadoopActiveMaster' (RSA) to the list of known hosts= .
2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcp= Port.jsch: SSH_MSG_NEWKEYS sent
2016-07-19 14:43:23,495 INFO org.= apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS received
2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: = SSH_MSG_SERVICE_REQUEST sent
2016-07-19 14:43:23,519 INFO org.apa= che.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received
=
2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.js= ch: Authentications that can continue: gssapi-with-mic,publickey,keyboard-i= nteractive,password
2016-07-19 14:43:23,524 INFO org.apache.hadoo= p.ha.SshFenceByTcpPort.jsch: Next authentication method: gssapi-with-mic
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort= .jsch: Authentications that can continue: publickey,keyboard-interactive,pa= ssword
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFence= ByTcpPort.jsch: Next authentication method: publickey
2016-07-19 = 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authenticati= on succeeded (publickey).
2016-07-19 14:43:23,624 INFO org.apache= .hadoop.ha.SshFenceByTcpPort: Connected to hadoopActiveMaster
201= 6-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking f= or process running on port 8020
2016-07-19 14:43:23,705 WARN org.= apache.hadoop.ha.SshFenceByTcpPort: PATH=3D$PATH:/sbin:/usr/sbin fuser -v -= k -n tcp 8020 via ssh: bash: fuser: command not found
2016-07-19 = 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: = Disconnecting from hadoopActiveMaster port 22
2016-07-19 14:43:23= ,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop= .ha.SshFenceByTcpPort(null) was unsuccessful.
2016-07-19 14:43:23= ,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any = configured method.
2016-07-19 14:43:23,719 WARN org.apache.hadoop= .ha.ActiveStandbyElector: Exception handling the winning of election
<= div>java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMas= ter/192.168.0.80:802= 0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ZKFailo= verController.doFence(ZKFailoverController.java:533)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ZKFailoverController.fenceOldActi= ve(ZKFailoverController.java:505)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at = org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.= java:61)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ZKFa= iloverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:= 892)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ActiveSt= andbyElector.fenceOldActive(ActiveStandbyElector.java:910)
=C2=A0= =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.ha.ActiveStandbyElector.becomeAc= tive(ActiveStandbyElector.java:809)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 a= t org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElec= tor.java:418)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.zookeeper= .ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 at org.apache.zookeeper.ClientCnxn$EventThread.run(Cli= entCnxn.java:498)
2016-07-19 14:43:23,719 INFO org.apache.hadoop.= ha.ActiveStandbyElector: Trying to re-establish ZK session
2016-0= 7-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught = an exception, leaving main loop due to Socket closed
2016-07-19 1= 4:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session: 0x35602bbb71e0002= closed
2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeep= er: Initiating client connection, connectString=3DhadoopActiveMaster:2181,h= adoopStandby:2181,hadoopSlave1:2181 sessionTimeout=3D5000 watcher=3Dorg.apa= che.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
= 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening socke= t connection to server hadoopActiveMaster/192.168.0.80:2181. Will not attempt to authentica= te using SASL (unknown error)
2016-07-19 14:43:24,762 INFO org.ap= ache.zookeeper.ClientCnxn: Socket connection established to hadoopActiveMas= ter/192.168.0.80:218= 1, initiating session
2016-07-19 14:43:24,773 INFO org.apache= .zookeeper.ClientCnxn: Session establishment complete on server hadoopActiv= eMaster/192.168.0.80= :2181, sessionid =3D 0x15602bba9e00003, negotiated timeout =3D 5000
2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventT= hread shut down
2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha= .ActiveStandbyElector: Session connected.


Please, help me to solve the problem with the configuration of Hadoo= p HA

Sincerely,
Alexandr


--001a1140cd744acce90538000969--