zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: zookeeper for namenode doesn't elect active, when host is down
Date Thu, 21 May 2015 17:28:55 GMT
Hello,

The HA implementations for NameNode and ResourceManager are slightly
different.  For the NameNode, there is a separate process called the
ZKFailoverController that owns the ZooKeeper session.  When that process
sees that it has obtained a lock through ZooKeeper, then it sends a
command to the NameNode on the same host to transition to active state.
For the ResourceManager, there is no separate failover controller process.
 Instead, the ResourceManager process directly runs the ZooKeeper client,
owns the ZooKeeper session, and handles its own failover semantics.

The symptoms that you described make it sound like perhaps one of the
ZKFailoverController processes is not running or is malfunctioning.  I
recommend starting the investigation there.  Full documentation of this
architecture and its configuration is available here:

http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HDFSHi
ghAvailabilityWithQJM.html


This is more of an HDFS question than a ZooKeeper question, so for any
follow-up discussion, I recommend restarting the thread on
user@hadoop.apache.org.

I hope this helps!

--Chris Nauroth




On 5/21/15, 6:30 AM, "skeletor@lissyara.su" <skeletor@lissyara.su> wrote:

>Hello.
>I have setup'ed hadoop HA-cluster with autofailoer on namenodes and
>resource manager by this manuals
>
>http://www.oracle.com/technetwork/articles/servers-storage-admin/hadoop-cl
>uster-solaris-2203962.html#16
>http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/Resource
>ManagerHA.html 
>
>
>So, when i halt only hadoop daemon, zookeeper swithes to active NameNode
>and ResMan. But when i halt a whole server (with zookeeper member of
>quorum) switches only ResMan.
>I have tried many configurations.
>
>here zoo.cfg
>
>tickTime=2000
>initLimit=5
>syncLimit=2
>dataDir=/var/zookeeper/data
>clientPort=2181
>cnxTimeout=3
>
>server.1=name-node1:2888:3888
>server.2=name-node2:2888:3888
>server.3=resource-manager:2888:3888
>server.4=resource-manager2:2888:3888
>server.5=data-node1:2888:3888
>server.6=data-node2:2888:3888
>
>group.1=1:2:5
>group.2=3:4:6
>
>core-site.xml
>
>   <property>
>     <name>ha.zookeeper.quorum</name>
>     <value>name-node1:2181,name-node2:2181,data-node1:2181</value>
>   </property>
>
>yarn-site.xml
>
>   <property>
>     <name>yarn.resourcemanager.zk-address</name>
> 
><value>resource-manager:2181,resource-manager2:2181,data-node2:2181</value
>>
>   </property>
>
>When i halted whole host name-node1 at zookeeper's log i see next:
>
>2015-05-21 13:24:22,177 [myid:5] - WARN
>[RecvWorker:3:QuorumCnxManager$RecvWorker@780] - Connection broken for
>id 3, my id = 5, error =
>java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
>org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumC
>nxManager.java:765)
>2015-05-21 13:24:22,178 [myid:5] - WARN
>[RecvWorker:3:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
>2015-05-21 13:24:22,179 [myid:5] - WARN
>[SendWorker:3:QuorumCnxManager$SendWorker@697] - Interrupted while
>waiting for message on queue
>java.lang.InterruptedException
>         at 
>java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.repo
>rtInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at 
>java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>tNanos(AbstractQueuedSynchronizer.java:2088)
>         at 
>java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at 
>org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCn
>xManager.java:849)
>         at 
>org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxMa
>nager.java:64)
>         at 
>org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC
>nxManager.java:685)
>2015-05-21 13:24:22,179 [myid:5] - WARN
>[SendWorker:3:QuorumCnxManager$SendWorker@706] - Send worker leaving
>thread
>
>When i halted whole host resource-manager at zookeeper's log i see next:
>
>
>2015-05-21 13:24:22,990 [myid:4] - INFO  [ProcessThread(sid:4
>cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
>when processing sessionid:0x34d767b51ef0000 type:create cxid:0x9
>zxid:0x1c0000004e txntype:-1 reqpath:n/a Error
>Path:/yarn-leader-election/dph-rm/ActiveStandbyElectorLock
>Error:KeeperErrorCode = NodeExists for
>/yarn-leader-election/dph-rm/ActiveStandbyElectorLock
>
>After this ResMan2 became an active.
>
>What i am doing wrong?
>Thanks.


Mime
View raw message