Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Message-ID: <555F0FE9.3080009@lissyara.su>
Date: Fri, 22 May 2015 14:15:53 +0300
From: "skeletor@lissyara.su" <skeletor@lissyara.su>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: user@zookeeper.apache.org
Subject: Re: zookeeper for namenode doesn't elect active, when host is down
References: <555DDDDD.5030501@lissyara.su>
 <D18361EE.22D6E%cnauroth@hortonworks.com>
In-Reply-To: <D18361EE.22D6E%cnauroth@hortonworks.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Thanks for reply, Chris. Now, I understand. Lookm i have a 2 NameNodes 
(maximum at HA-cluster), when started ZKFS. So, when host with one node 
halt, there is ONLY one ZKFS is running. And it cannot elect a leader. 
When i try to run a ZKFC on datanodes or ResMan i get an error:


Exception in thread "main" 
org.apache.hadoop.HadoopIllegalArgumentException: Could not get the 
namenode ID of this node. You may run zkfc on the node other than namenode.
         at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.create(DFSZKFailoverController.java:128)
         at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:177)
  (-u) 999990
virtual memory          (kbytes, -v) unlimited


How can i start ZKFS on other node, other than namenode?

21.05.2015 20:28, Chris Nauroth пишет:
> Hello,
>
> The HA implementations for NameNode and ResourceManager are slightly
> different.  For the NameNode, there is a separate process called the
> ZKFailoverController that owns the ZooKeeper session.  When that process
> sees that it has obtained a lock through ZooKeeper, then it sends a
> command to the NameNode on the same host to transition to active state.
> For the ResourceManager, there is no separate failover controller process.
>   Instead, the ResourceManager process directly runs the ZooKeeper client,
> owns the ZooKeeper session, and handles its own failover semantics.
>
> The symptoms that you described make it sound like perhaps one of the
> ZKFailoverController processes is not running or is malfunctioning.  I
> recommend starting the investigation there.  Full documentation of this
> architecture and its configuration is available here:
>
> http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HDFSHi
> ghAvailabilityWithQJM.html
>
>
> This is more of an HDFS question than a ZooKeeper question, so for any
> follow-up discussion, I recommend restarting the thread on
> user@hadoop.apache.org.
>
> I hope this helps!
>
> --Chris Nauroth
>
>
>
>
> On 5/21/15, 6:30 AM, "skeletor@lissyara.su" <skeletor@lissyara.su> wrote:
>
>> Hello.
>> I have setup'ed hadoop HA-cluster with autofailoer on namenodes and
>> resource manager by this manuals
>>
>> http://www.oracle.com/technetwork/articles/servers-storage-admin/hadoop-cl
>> uster-solaris-2203962.html#16
>> http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/Resource
>> ManagerHA.html
>>
>>
>> So, when i halt only hadoop daemon, zookeeper swithes to active NameNode
>> and ResMan. But when i halt a whole server (with zookeeper member of
>> quorum) switches only ResMan.
>> I have tried many configurations.
>>
>> here zoo.cfg
>>
>> tickTime=2000
>> initLimit=5
>> syncLimit=2
>> dataDir=/var/zookeeper/data
>> clientPort=2181
>> cnxTimeout=3
>>
>> server.1=name-node1:2888:3888
>> server.2=name-node2:2888:3888
>> server.3=resource-manager:2888:3888
>> server.4=resource-manager2:2888:3888
>> server.5=data-node1:2888:3888
>> server.6=data-node2:2888:3888
>>
>> group.1=1:2:5
>> group.2=3:4:6
>>
>> core-site.xml
>>
>>    <property>
>>      <name>ha.zookeeper.quorum</name>
>>      <value>name-node1:2181,name-node2:2181,data-node1:2181</value>
>>    </property>
>>
>> yarn-site.xml
>>
>>    <property>
>>      <name>yarn.resourcemanager.zk-address</name>
>>
>> <value>resource-manager:2181,resource-manager2:2181,data-node2:2181</value
>>>
>>    </property>
>>
>> When i halted whole host name-node1 at zookeeper's log i see next:
>>
>> 2015-05-21 13:24:22,177 [myid:5] - WARN
>> [RecvWorker:3:QuorumCnxManager$RecvWorker@780] - Connection broken for
>> id 3, my id = 5, error =
>> java.io.EOFException
>>          at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>          at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumC
>> nxManager.java:765)
>> 2015-05-21 13:24:22,178 [myid:5] - WARN
>> [RecvWorker:3:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
>> 2015-05-21 13:24:22,179 [myid:5] - WARN
>> [SendWorker:3:QuorumCnxManager$SendWorker@697] - Interrupted while
>> waiting for message on queue
>> java.lang.InterruptedException
>>          at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.repo
>> rtInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>>          at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> tNanos(AbstractQueuedSynchronizer.java:2088)
>>          at
>> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>>          at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCn
>> xManager.java:849)
>>          at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxMa
>> nager.java:64)
>>          at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC
>> nxManager.java:685)
>> 2015-05-21 13:24:22,179 [myid:5] - WARN
>> [SendWorker:3:QuorumCnxManager$SendWorker@706] - Send worker leaving
>> thread
>>
>> When i halted whole host resource-manager at zookeeper's log i see next:
>>
>>
>> 2015-05-21 13:24:22,990 [myid:4] - INFO  [ProcessThread(sid:4
>> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
>> when processing sessionid:0x34d767b51ef0000 type:create cxid:0x9
>> zxid:0x1c0000004e txntype:-1 reqpath:n/a Error
>> Path:/yarn-leader-election/dph-rm/ActiveStandbyElectorLock
>> Error:KeeperErrorCode = NodeExists for
>> /yarn-leader-election/dph-rm/ActiveStandbyElectorLock
>>
>> After this ResMan2 became an active.
>>
>> What i am doing wrong?
>> Thanks.
>
>