Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 85C0ACE85 for ; Mon, 11 Jun 2012 12:11:49 +0000 (UTC) Received: (qmail 97933 invoked by uid 500); 11 Jun 2012 12:11:47 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 97671 invoked by uid 500); 11 Jun 2012 12:11:45 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 96856 invoked by uid 99); 11 Jun 2012 12:11:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jun 2012 12:11:44 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com designates 65.55.116.88 as permitted sender) Received: from [65.55.116.88] (HELO blu0-omc3-s13.blu0.hotmail.com) (65.55.116.88) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jun 2012 12:11:37 +0000 Received: from BLU0-SMTP179 ([65.55.116.74]) by blu0-omc3-s13.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Mon, 11 Jun 2012 05:11:07 -0700 X-Originating-IP: [173.15.87.37] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [192.168.0.100] ([173.15.87.37]) by BLU0-SMTP179.phx.gbl over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Mon, 11 Jun 2012 05:11:05 -0700 Subject: Re: HBase dies after some time MIME-Version: 1.0 (Apple Message framework v1278) Content-Type: text/plain; charset="iso-8859-1" From: Michael Segel In-Reply-To: Date: Mon, 11 Jun 2012 07:11:04 -0500 CC: zookeeper-user@hadoop.apache.org Content-Transfer-Encoding: quoted-printable References: To: user@hbase.apache.org X-Mailer: Apple Mail (2.1278) X-OriginalArrivalTime: 11 Jun 2012 12:11:05.0426 (UTC) FILETIME=[46BBD320:01CD47CB] X-Virus-Checked: Checked by ClamAV on apache.org Hi, Sorry for the late post to this thread... To add what Harsh has to say... YOU NEVER, NEVER EVER RUN ZK ON THE SAME NODE AS YOUR TTs. Sorry for shouting but that's a core design rule that shouldn't be = broken at all costs.=20 You would be more stable running one ZK on a control node than you would = be running them on the TT/DN nodes.=20 While a little swap won't kill a Hadoop cluster running just M/R, add = HBase and swapping becomes fatal. This is the core problem w = Christian's machine.=20 Because you can run Hadoop on everything from a VM, single machine to a = cluster of 1000+ machines, hardware design is often overlooked and with = each major hardware vendor creating their own reference architecture, it = gets confusing and you may end up spending $$$ on resources you can't = fully take advantage of. On May 30, 2012, at 2:33 AM, Harsh J wrote: > You may colocate your ZK with the HBase Master as its not very heavy. > Depending on your cluster size, 1-3 may be enough and you can divide > it among HBM, SNN and perhaps NN/JT machines. >=20 > On Wed, May 30, 2012 at 2:54 AM, Something Something > wrote: >> Hmm.. due to budget constraints, I am forced to install ZooKeeper on = the >> same machine that runs TaskTracker. When a big MR job starts it = fires up >> over 40 tasks, so as you implied this could definitely be related to = memory. >>=20 >> Should ZooKeepers be started on their own machines? Right now I have >> ZooKeeper, HRegionServer & TaskTracker running on the same machine. = This >> is a bad idea, right? Is there any way to get ZooKeeper working = under >> these restrictions? >>=20 >> By the way, the ZooKeeper log shows this: >>=20 >> 2012-05-29 13:56:54,842 - ERROR [CommitProcessor:2:NIOServerCnxn@445] = - >> Unexpected Exception: >> java.nio.channels.CancelledKeyException >> at = sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) >> at = sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) >> at >> = org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:41= 8) >> at >> = org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:= 1509) >> at >> = org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequ= estProcessor.java:367) >> at >> = org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.jav= a:73) >>=20 >>=20 >>=20 >>=20 >> On Sat, May 26, 2012 at 2:28 AM, Christian Sch=E4fer >> wrote: >>=20 >>>=20 >>> Hi, >>>=20 >>> I got exactly the same behaviour and exceptions that you mention on = a >>> local cluster. >>>=20 >>> In my case the sum of all services' heapspace was higher than the = actual >>> memory of the machine. >>> At >>> first sum the heapspaces of your master machine likely running >>> NameNode, HMaster, ZooKeeper, and maybe also, RegionServer and = DataNode >>> Then check that this sum is lesser than your master machines memory. >>>=20 >>> Good Luck. >>> Chris >>>=20 >>> Von: Something Something >>> An: >>> hbase-user@hadoop.apache.org; zookeeper-user@hadoop.apache.org >>> Gesendet: 3:22 Samstag, 26.Mai 2012 >>> Betreff: HBase dies after some time >>>=20 >>> Hello, >>>=20 >>> I recently installed ZooKeeper & HBase on our dedicated Hadoop = cluster on >>> EC2. The HBase stays active for some time, but after a while it = dies with >>> error messages similar to these: >>>=20 >>> 2012-05-25 12:09:27,514 ERROR >>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: >>> master:60000-0x5378489312c0004-0x5378489312c0004 Received unexpected >>> KeeperException, re-throwing exception >>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>> KeeperErrorCode =3D ConnectionLoss for /hbase/master >>> at >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:90) >>>=20 >>> at >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) >>> at >>> = org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) >>> at >>> = org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)= >>> at >>>=20 >>> = org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManage= r.java:197) >>> at = org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310) >>> 2012-05-25 12:09:27,514 ERROR >>> org.apache.hadoop.hbase.master.ActiveMasterManager: >>> master:60000-0x5378489312c0004-0x5378489312c0004 Error deleting our = own >>> master address node >>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>> KeeperErrorCode =3D ConnectionLoss for /hbase/master >>>=20 >>> at >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:90) >>> at >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) >>> at >>> = org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) >>> at >>> = org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)= >>> at >>>=20 >>> = org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManage= r.java:197) >>> at = org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:310) >>>=20 >>>=20 >>> This kills the HMaster as well as all HRegionServers. Could it be = that my >>> ZooKeeper setup is incorrect? Please help. Thanks. >>>=20 >=20 >=20 >=20 > --=20 > Harsh J >=20