Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AEB1F18949 for ; Fri, 22 May 2015 11:17:27 +0000 (UTC) Received: (qmail 68489 invoked by uid 500); 22 May 2015 11:17:26 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 68437 invoked by uid 500); 22 May 2015 11:17:26 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 68426 invoked by uid 99); 22 May 2015 11:17:26 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2015 11:17:26 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 322741A310C for ; Fri, 22 May 2015 11:17:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.991 X-Spam-Level: X-Spam-Status: No, score=0.991 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id YeCycuko-NaT for ; Fri, 22 May 2015 11:17:11 +0000 (UTC) Received: from mx.lissyara.su (mx.lissyara.su [91.227.18.27]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id DACF520343 for ; Fri, 22 May 2015 11:17:10 +0000 (UTC) Received: from [62.80.166.26] (helo=[10.5.5.55]) by mx.lissyara.su with esmtpsa (TLSv1.2:DHE-RSA-AES128-SHA:128) (Exim 4.85 (FreeBSD)) (envelope-from ) id 1Yvkvh-0004R5-Tu for user@zookeeper.apache.org; Fri, 22 May 2015 14:15:53 +0300 Message-ID: <555F0FE9.3080009@lissyara.su> Date: Fri, 22 May 2015 14:15:53 +0300 From: "skeletor@lissyara.su" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: user@zookeeper.apache.org Subject: Re: zookeeper for namenode doesn't elect active, when host is down References: <555DDDDD.5030501@lissyara.su> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Thanks for reply, Chris. Now, I understand. Lookm i have a 2 NameNodes (maximum at HA-cluster), when started ZKFS. So, when host with one node halt, there is ONLY one ZKFS is running. And it cannot elect a leader. When i try to run a ZKFC on datanodes or ResMan i get an error: Exception in thread "main" org.apache.hadoop.HadoopIllegalArgumentException: Could not get the namenode ID of this node. You may run zkfc on the node other than namenode. at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.create(DFSZKFailoverController.java:128) at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:177) (-u) 999990 virtual memory (kbytes, -v) unlimited How can i start ZKFS on other node, other than namenode? 21.05.2015 20:28, Chris Nauroth пишет: > Hello, > > The HA implementations for NameNode and ResourceManager are slightly > different. For the NameNode, there is a separate process called the > ZKFailoverController that owns the ZooKeeper session. When that process > sees that it has obtained a lock through ZooKeeper, then it sends a > command to the NameNode on the same host to transition to active state. > For the ResourceManager, there is no separate failover controller process. > Instead, the ResourceManager process directly runs the ZooKeeper client, > owns the ZooKeeper session, and handles its own failover semantics. > > The symptoms that you described make it sound like perhaps one of the > ZKFailoverController processes is not running or is malfunctioning. I > recommend starting the investigation there. Full documentation of this > architecture and its configuration is available here: > > http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HDFSHi > ghAvailabilityWithQJM.html > > > This is more of an HDFS question than a ZooKeeper question, so for any > follow-up discussion, I recommend restarting the thread on > user@hadoop.apache.org. > > I hope this helps! > > --Chris Nauroth > > > > > On 5/21/15, 6:30 AM, "skeletor@lissyara.su" wrote: > >> Hello. >> I have setup'ed hadoop HA-cluster with autofailoer on namenodes and >> resource manager by this manuals >> >> http://www.oracle.com/technetwork/articles/servers-storage-admin/hadoop-cl >> uster-solaris-2203962.html#16 >> http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/Resource >> ManagerHA.html >> >> >> So, when i halt only hadoop daemon, zookeeper swithes to active NameNode >> and ResMan. But when i halt a whole server (with zookeeper member of >> quorum) switches only ResMan. >> I have tried many configurations. >> >> here zoo.cfg >> >> tickTime=2000 >> initLimit=5 >> syncLimit=2 >> dataDir=/var/zookeeper/data >> clientPort=2181 >> cnxTimeout=3 >> >> server.1=name-node1:2888:3888 >> server.2=name-node2:2888:3888 >> server.3=resource-manager:2888:3888 >> server.4=resource-manager2:2888:3888 >> server.5=data-node1:2888:3888 >> server.6=data-node2:2888:3888 >> >> group.1=1:2:5 >> group.2=3:4:6 >> >> core-site.xml >> >> >> ha.zookeeper.quorum >> name-node1:2181,name-node2:2181,data-node1:2181 >> >> >> yarn-site.xml >> >> >> yarn.resourcemanager.zk-address >> >> resource-manager:2181,resource-manager2:2181,data-node2:2181>> >> >> >> When i halted whole host name-node1 at zookeeper's log i see next: >> >> 2015-05-21 13:24:22,177 [myid:5] - WARN >> [RecvWorker:3:QuorumCnxManager$RecvWorker@780] - Connection broken for >> id 3, my id = 5, error = >> java.io.EOFException >> at java.io.DataInputStream.readInt(DataInputStream.java:392) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumC >> nxManager.java:765) >> 2015-05-21 13:24:22,178 [myid:5] - WARN >> [RecvWorker:3:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker >> 2015-05-21 13:24:22,179 [myid:5] - WARN >> [SendWorker:3:QuorumCnxManager$SendWorker@697] - Interrupted while >> waiting for message on queue >> java.lang.InterruptedException >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.repo >> rtInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> tNanos(AbstractQueuedSynchronizer.java:2088) >> at >> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCn >> xManager.java:849) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxMa >> nager.java:64) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC >> nxManager.java:685) >> 2015-05-21 13:24:22,179 [myid:5] - WARN >> [SendWorker:3:QuorumCnxManager$SendWorker@706] - Send worker leaving >> thread >> >> When i halted whole host resource-manager at zookeeper's log i see next: >> >> >> 2015-05-21 13:24:22,990 [myid:4] - INFO [ProcessThread(sid:4 >> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException >> when processing sessionid:0x34d767b51ef0000 type:create cxid:0x9 >> zxid:0x1c0000004e txntype:-1 reqpath:n/a Error >> Path:/yarn-leader-election/dph-rm/ActiveStandbyElectorLock >> Error:KeeperErrorCode = NodeExists for >> /yarn-leader-election/dph-rm/ActiveStandbyElectorLock >> >> After this ResMan2 became an active. >> >> What i am doing wrong? >> Thanks. > >