hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakesh Radhakrishnan <rake...@apache.org>
Subject Re: ZKFC fencing problem after the active node crash
Date Wed, 20 Jul 2016 03:02:25 GMT
Hi Alexandr,

Since you powered off the Active NN machine, during fail-over SNN timed out
to connect to this machine and fencing is failed. Typically fencing methods
should be configured to not to allow multiple writers to same shared
storage. It looks like you are using 'QJM' and it supports the fencing
feature on its own. i.e. it wont allow multiple writers at a time. So I
think external fencing methods can be skipped. AFAIK, to improve the
availability of the system in the event the fencing mechanisms fail, it is
advisable to configure a fencing method which is guaranteed to return
success. You can remove the SSH fencing method from both machines
configuration. Please try the below shell based fence method just to skip
SSH fence and restart the cluster. Then fail over will happen successfully.

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/bin/true)</value>
</property>

*Reference:-*
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
"*JournalNodes will only ever allow a single NameNode to be a writer at a
time. During a failover, the NameNode which is to become active will simply
take over the role of writing to the JournalNodes, which will effectively
prevent the other NameNode from continuing in the Active state, allowing
the new Active to safely proceed with failover*."

Regards,
Rakesh

On Wed, Jul 20, 2016 at 12:52 AM, Alexandr Porunov <
alexandr.porunov@gmail.com> wrote:

> Hello,
>
> I have configured Hadoop HA cluster. It works like in tutorials. If I kill
> Namenode process with command "kill -9 NameNodeProcessId" my standby node
> changes its state to active. But if I power off active node then standby
> node can't change its state to active because it trys to connect to the
> crashed node by using SSH.
>
> This parameter doesn't work:
> <property>
>         <name>dfs.ha.fencing.ssh.connect-timeout</name>
>         <value>3000</value>
> </property>
>
> I read from the documentation that it is 5 second by default. But even
> after 5 minutes standby node continue try to connect to crashed node. I set
> it manually for 3 second but it still doesn't work. So, if we just kill
> namenode process our cluster works but if we crash active node our cluster
> become unavailable.
>
> *Here is the part of ZKFC logs (After the crash the logger writes the same
> information infinitely)*:
> 2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoopActiveMaster...
> 2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoopActiveMaster port 22
> 2016-07-19 20:56:27,148 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> Unable to connect to hadoopActiveMaster as user hadoop
> com.jcraft.jsch.JSchException: timeout: socket is not established
>         at com.jcraft.jsch.Util.createSocket(Util.java:386)
>         at com.jcraft.jsch.Session.connect(Session.java:182)
>         at
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>         at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>         at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
>         at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>         at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2016-07-19 20:56:27,149 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsucce
> ssful.
> 2016-07-19 20:56:27,149 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2016-07-19 20:56:27,150 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/
> 192.168.0.80:8020
>         at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>         at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>         at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2016-07-19 20:56:27,150 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2016-07-19 20:56:27,177 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x3560443c6e30003 closed
> 2016-07-19 20:56:28,183 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
> sessionTimeout=5000
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@d49b070
> 2016-07-19 20:56:28,186 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoopStandby/192.168.0.81:2181. Will not
> attempt to authenticate using SASL (unknown error)
> 2016-07-19 20:56:28,187 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoopStandby/192.168.0.81:2181, initiating
> session
> 2016-07-19 20:56:28,197 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoopStandby/192.168.0.81:2181,
> sessionid = 0x2560443c4670003, negotiated timeout = 5000
> 2016-07-19 20:56:28,199 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2016-07-19 20:56:28,203 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Session connected.
> 2016-07-19 20:56:28,207 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Checking for any old active which needs to be fenced...
> 2016-07-19 20:56:28,210 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Old node exists:
> 0a096d79636c757374657212036d6e311a126861646f6f704163746976654d617374657220d43e28d33e
> 2016-07-19 20:56:28,213 INFO org.apache.hadoop.ha.ZKFailoverController:
> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
> 2016-07-19 20:56:48,232 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0
> time(s); maxRetries=1
> 2016-07-19 20:57:08,242 WARN org.apache.hadoop.ha.FailoverController:
> Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020
> standby (unable to connect)
> org.apache.hadoop.net.ConnectTimeoutException: Call From hadoopStandby/
> 192.168.0.81 to hadoopActiveMaster:8020 failed on socket timeout
> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=hadoopActiveMaster/192.168.0.80:8020]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>         at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>         at
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>         at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>         at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>         at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=hadoopActiveMaster/192.168.0.80:8020]
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>         ... 14 more
> 2016-07-19 20:57:08,243 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
>
> *Here is my hdfs-site.xml*:
> <configuration>
>     <property>
>         <name>dfs.nameservices</name>
>         <value>mycluster</value>
>         <final>true</final>
>     </property>
>     <property>
>         <name>dfs.ha.namenodes.mycluster</name>
>         <value>mn1,mn2</value>
>         <final>true</final>
>     </property>
>     <property>
>         <name>dfs.namenode.rpc-address.mycluster.mn1</name>
>         <value>hadoopActiveMaster:8020</value>
>     </property>
>     <property>
>         <name>dfs.namenode.rpc-address.mycluster.mn2</name>
>         <value>hadoopStandby:8020</value>
>     </property>
>     <property>
>         <name>dfs.namenode.http-address.mycluster.mn1</name>
>         <value>hadoopActiveMaster:50070</value>
>     </property>
>     <property>
>         <name>dfs.namenode.http-address.mycluster.mn2</name>
>         <value>hadoopStandby:50070</value>
>     </property>
>     <property>
>         <name>ha.zookeeper.quorum</name>
>
> <value>hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181</value>
>     </property>
>     <property>
>         <name>dfs.namenode.shared.edits.dir</name>
>
> <value>qjournal://hadoopActiveMaster:8485;hadoopStandby:8485;hadoopSlave1:8485/mycluster</value>
>     </property>
>     <property>
>         <name>dfs.ha.automatic-failover.enabled</name>
>         <value>true</value>
>     </property>
>     <property>
>         <name>dfs.replication</name>
>         <value>3</value>
>     </property>
>     <property>
>         <name>dfs.ha.fencing.methods</name>
>         <value>sshfence</value>
>     </property>
>     <property>
>         <name>dfs.ha.fencing.ssh.private-key-files</name>
>         <value>/usr/hadoop/.ssh/id_rsa</value>
>     </property>
>     <property>
>         <name>dfs.ha.fencing.ssh.connect-timeout</name>
>         <value>3000</value>
>     </property>
> </configuration>
>
> *Here is my core-site.xml*:
> <configuration>
>   <property>
>     <name>fs.defaultFS</name>
>     <value>hdfs://mycluster</value>
>   </property>
>   <property>
>     <name>dfs.journalnode.edits.dir</name>
>     <value>/var/hadoop/jn</value>
>   </property>
>   <property>
>     <name>hadoop.tmp.dir</name>
>     <value>/usr/hadoop/tmp</value>
>   </property>
> </configuration>
>
> *Here is my zoo.cfg:*
> tickTime=2000
> initLimit=10
> syncLimit=5
> dataDir=/var/zookeeper/data
> dataLogDir=/var/zookeeper/logs
> clientPort=2181
>
> server.1=hadoopActiveMaster:2888:3888
> server.2=hadoopStandby:2888:3888
> server.3=hadoopSlave1:2888:3888
>
> *Here is my /etc/hosts:*
> 127.0.0.1   me
> 192.168.0.80 hadoopActiveMaster
> 192.168.0.81 hadoopStandby
> 192.168.0.82 hadoopSlave1
> 192.168.0.83 hadoopSlave2
>
> Please, help me to solve this problem
>
> Sincerely,
> Alexandr
>

Mime
View raw message