hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandr Porunov <alexandr.poru...@gmail.com>
Subject Re: ZKFC fencing problem after the active node crash
Date Wed, 20 Jul 2016 14:31:52 GMT
Hi Rakesh,

Thank you very much for your help! It helped. Now after the active node
crash the standby node becomes active

Best regards,
Alexandr

On Wed, Jul 20, 2016 at 6:02 AM, Rakesh Radhakrishnan <rakeshr@apache.org>
wrote:

> Hi Alexandr,
>
> Since you powered off the Active NN machine, during fail-over SNN timed
> out to connect to this machine and fencing is failed. Typically fencing
> methods should be configured to not to allow multiple writers to same
> shared storage. It looks like you are using 'QJM' and it supports the
> fencing feature on its own. i.e. it wont allow multiple writers at a time.
> So I think external fencing methods can be skipped. AFAIK, to improve the
> availability of the system in the event the fencing mechanisms fail, it is
> advisable to configure a fencing method which is guaranteed to return
> success. You can remove the SSH fencing method from both machines
> configuration. Please try the below shell based fence method just to skip
> SSH fence and restart the cluster. Then fail over will happen successfully.
>
> <property>
>   <name>dfs.ha.fencing.methods</name>
>   <value>shell(/bin/true)</value>
> </property>
>
> *Reference:-*
> https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
> "*JournalNodes will only ever allow a single NameNode to be a writer at a
> time. During a failover, the NameNode which is to become active will simply
> take over the role of writing to the JournalNodes, which will effectively
> prevent the other NameNode from continuing in the Active state, allowing
> the new Active to safely proceed with failover*."
>
> Regards,
> Rakesh
>
> On Wed, Jul 20, 2016 at 12:52 AM, Alexandr Porunov <
> alexandr.porunov@gmail.com> wrote:
>
>> Hello,
>>
>> I have configured Hadoop HA cluster. It works like in tutorials. If I
>> kill Namenode process with command "kill -9 NameNodeProcessId" my standby
>> node changes its state to active. But if I power off active node then
>> standby node can't change its state to active because it trys to connect to
>> the crashed node by using SSH.
>>
>> This parameter doesn't work:
>> <property>
>>         <name>dfs.ha.fencing.ssh.connect-timeout</name>
>>         <value>3000</value>
>> </property>
>>
>> I read from the documentation that it is 5 second by default. But even
>> after 5 minutes standby node continue try to connect to crashed node. I set
>> it manually for 3 second but it still doesn't work. So, if we just kill
>> namenode process our cluster works but if we crash active node our cluster
>> become unavailable.
>>
>> *Here is the part of ZKFC logs (After the crash the logger writes the
>> same information infinitely)*:
>> 2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>> 2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connecting to hadoopActiveMaster...
>> 2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connecting to hadoopActiveMaster port 22
>> 2016-07-19 20:56:27,148 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> Unable to connect to hadoopActiveMaster as user hadoop
>> com.jcraft.jsch.JSchException: timeout: socket is not established
>>         at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>         at com.jcraft.jsch.Session.connect(Session.java:182)
>>         at
>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>         at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> 2016-07-19 20:56:27,149 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsucce
>> ssful.
>> 2016-07-19 20:56:27,149 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>> fence service by any configured method.
>> 2016-07-19 20:56:27,150 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>> Exception handling the winning of election
>> java.lang.RuntimeException: Unable to fence NameNode at
>> hadoopActiveMaster/192.168.0.80:8020
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> 2016-07-19 20:56:27,150 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Trying to re-establish ZK session
>> 2016-07-19 20:56:27,177 INFO org.apache.zookeeper.ZooKeeper: Session:
>> 0x3560443c6e30003 closed
>> 2016-07-19 20:56:28,183 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection,
>> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@d49b070
>> 2016-07-19 20:56:28,186 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server hadoopStandby/192.168.0.81:2181. Will not
>> attempt to authenticate using SASL (unknown error)
>> 2016-07-19 20:56:28,187 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to hadoopStandby/192.168.0.81:2181, initiating
>> session
>> 2016-07-19 20:56:28,197 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server hadoopStandby/192.168.0.81:2181,
>> sessionid = 0x2560443c4670003, negotiated timeout = 5000
>> 2016-07-19 20:56:28,199 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>> 2016-07-19 20:56:28,203 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Session connected.
>> 2016-07-19 20:56:28,207 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Checking for any old active which needs to be fenced...
>> 2016-07-19 20:56:28,210 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Old node exists:
>> 0a096d79636c757374657212036d6e311a126861646f6f704163746976654d617374657220d43e28d33e
>> 2016-07-19 20:56:28,213 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
>> 2016-07-19 20:56:48,232 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0
>> time(s); maxRetries=1
>> 2016-07-19 20:57:08,242 WARN org.apache.hadoop.ha.FailoverController:
>> Unable to gracefully make NameNode at hadoopActiveMaster/
>> 192.168.0.80:8020 standby (unable to connect)
>> org.apache.hadoop.net.ConnectTimeoutException: Call From hadoopStandby/
>> 192.168.0.81 to hadoopActiveMaster:8020 failed on socket timeout
>> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=hadoopActiveMaster/192.168.0.80:8020]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>         at
>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>>         at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>>         at
>> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>>         at
>> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=hadoopActiveMaster/192.168.0.80:8020]
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>>         ... 14 more
>> 2016-07-19 20:57:08,243 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>>
>> *Here is my hdfs-site.xml*:
>> <configuration>
>>     <property>
>>         <name>dfs.nameservices</name>
>>         <value>mycluster</value>
>>         <final>true</final>
>>     </property>
>>     <property>
>>         <name>dfs.ha.namenodes.mycluster</name>
>>         <value>mn1,mn2</value>
>>         <final>true</final>
>>     </property>
>>     <property>
>>         <name>dfs.namenode.rpc-address.mycluster.mn1</name>
>>         <value>hadoopActiveMaster:8020</value>
>>     </property>
>>     <property>
>>         <name>dfs.namenode.rpc-address.mycluster.mn2</name>
>>         <value>hadoopStandby:8020</value>
>>     </property>
>>     <property>
>>         <name>dfs.namenode.http-address.mycluster.mn1</name>
>>         <value>hadoopActiveMaster:50070</value>
>>     </property>
>>     <property>
>>         <name>dfs.namenode.http-address.mycluster.mn2</name>
>>         <value>hadoopStandby:50070</value>
>>     </property>
>>     <property>
>>         <name>ha.zookeeper.quorum</name>
>>
>> <value>hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181</value>
>>     </property>
>>     <property>
>>         <name>dfs.namenode.shared.edits.dir</name>
>>
>> <value>qjournal://hadoopActiveMaster:8485;hadoopStandby:8485;hadoopSlave1:8485/mycluster</value>
>>     </property>
>>     <property>
>>         <name>dfs.ha.automatic-failover.enabled</name>
>>         <value>true</value>
>>     </property>
>>     <property>
>>         <name>dfs.replication</name>
>>         <value>3</value>
>>     </property>
>>     <property>
>>         <name>dfs.ha.fencing.methods</name>
>>         <value>sshfence</value>
>>     </property>
>>     <property>
>>         <name>dfs.ha.fencing.ssh.private-key-files</name>
>>         <value>/usr/hadoop/.ssh/id_rsa</value>
>>     </property>
>>     <property>
>>         <name>dfs.ha.fencing.ssh.connect-timeout</name>
>>         <value>3000</value>
>>     </property>
>> </configuration>
>>
>> *Here is my core-site.xml*:
>> <configuration>
>>   <property>
>>     <name>fs.defaultFS</name>
>>     <value>hdfs://mycluster</value>
>>   </property>
>>   <property>
>>     <name>dfs.journalnode.edits.dir</name>
>>     <value>/var/hadoop/jn</value>
>>   </property>
>>   <property>
>>     <name>hadoop.tmp.dir</name>
>>     <value>/usr/hadoop/tmp</value>
>>   </property>
>> </configuration>
>>
>> *Here is my zoo.cfg:*
>> tickTime=2000
>> initLimit=10
>> syncLimit=5
>> dataDir=/var/zookeeper/data
>> dataLogDir=/var/zookeeper/logs
>> clientPort=2181
>>
>> server.1=hadoopActiveMaster:2888:3888
>> server.2=hadoopStandby:2888:3888
>> server.3=hadoopSlave1:2888:3888
>>
>> *Here is my /etc/hosts:*
>> 127.0.0.1   me
>> 192.168.0.80 hadoopActiveMaster
>> 192.168.0.81 hadoopStandby
>> 192.168.0.82 hadoopSlave1
>> 192.168.0.83 hadoopSlave2
>>
>> Please, help me to solve this problem
>>
>> Sincerely,
>> Alexandr
>>
>
>

Mime
View raw message