Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 783B3F1D7 for ; Mon, 29 Apr 2013 09:25:52 +0000 (UTC) Received: (qmail 56556 invoked by uid 500); 29 Apr 2013 09:25:50 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 56332 invoked by uid 500); 29 Apr 2013 09:25:50 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 56319 invoked by uid 99); 29 Apr 2013 09:25:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Apr 2013 09:25:49 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com designates 209.85.192.181 as permitted sender) Received: from [209.85.192.181] (HELO mail-pd0-f181.google.com) (209.85.192.181) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Apr 2013 09:25:43 +0000 Received: by mail-pd0-f181.google.com with SMTP id q10so1165098pdj.12 for ; Mon, 29 Apr 2013 02:25:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:x-mailer:from:subject:date :to; bh=URGsPXrvq+bs9bxHbW7rS3kYGGDTxpRSXSPftHPuN60=; b=MXrJtO0AGaxH0Iiu77YLHRX6PKSmPjcTf6iPDKOJ+vvgo/D/HZUxTGXfYl/eNe1+CO PX47NvjWJtXA4XlrqpHwHyAPG4lsXiZmupGsG01OKSSzSVV9NADBVJfUMsdCVDceBANz b7i5JrmDKvWlMCYmyZfLh6746Fybaik0fws3M4+RbgFYYkuS1IAyDzae0KJH4tHDFZFD PFEFQPEwMebRbWhNS9pOgr7Xf0b2Fp5UlpwwBJg0rwiLQtSsqbP+mmS8HGw38c7AAGrC XS5NnHpvLbpJZWGde3nFcx0qUNX3OxfDCzTcvY5Ta5UObpo98Psy5W8ct8ysInqQaCe3 ZZBA== X-Received: by 10.68.212.168 with SMTP id nl8mr69438351pbc.43.1367227523382; Mon, 29 Apr 2013 02:25:23 -0700 (PDT) Received: from [192.168.0.14] (c-24-130-233-55.hsd1.ca.comcast.net. [24.130.233.55]) by mx.google.com with ESMTPSA id dg5sm23316207pbc.29.2013.04.29.02.25.15 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 29 Apr 2013 02:25:22 -0700 (PDT) References: Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <71A49179-C671-4290-AAB0-979AB7FCC2FA@gmail.com> Cc: "user@hbase.apache.org" X-Mailer: iPhone Mail (10B146) From: Ted Yu Subject: Re: max regionserver handler count Date: Mon, 29 Apr 2013 02:25:13 -0700 To: "user@hbase.apache.org" X-Virus-Checked: Checked by ClamAV on apache.org I noticed the 8 occurrences of 0x703e... following region server name in the= abort message.=20 I wonder why the repetition ? Cheers =20 On Apr 29, 2013, at 2:17 AM, Viral Bajaria wrote: > On Sun, Apr 28, 2013 at 7:37 PM, ramkrishna vasudevan < > ramkrishna.s.vasudevan@gmail.com> wrote: >=20 >> So you mean that when the handler count is more than 5k this happens when= >> it is lesser this does not. Have you repeated this behaviour? >=20 >> What i doubt is when you say bouncing around different states i feel may b= e >> the ROOT assignment was a problem and something messed up there. >> If the reason was due to handler count then that needs different analysis= . >>=20 >> I think that if you can repeat the experiment and get the same behaviour,= >> you can share the logs so that we can ascertain the exact problem. >=20 > Yeah I have repeated the behavior. But it seems the issue is due to some > weird pauses in the region server whenever I bump up the region handler > count (logs are below). I doubt the issue is GC, since it should not take > such a long time because this is happening on startup with 48GB heap size.= > There are no active clients either. >=20 > I can safely say this is due to bumping up the region handler count is due= > to the fact that I started 3 regionservers with 5000 handlers and 3 with > 15000 handlers. The one's with 15000 spun up all IPC handlers and then > errored out as show in the logs below. This is just the logs around the > error. Before the error there were a few more timeouts. >=20 > I checked zookeeper servers (I have a 3-node cluster) and it did not GC > around the same time nor was it under any kind of load. >=20 > Thanks, > Viral >=20 > Region Server Logs > 2013-04-29 08:00:55,512 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=3D98.34 MB, > free=3D11.61 GB, max=3D11.71 GB, blocks=3D0, accesses=3D0, hits=3D0, hitRa= tio=3D0, > cachingAccesses=3D0, cachingHits=3D0, cachingHitsRatio=3D0, evictions=3D0,= > evicted=3D0, evictedPerRun=3DNaN > 2013-04-29 08:02:35,674 INFO org.apache.zookeeper.ClientCnxn: Client > session timed out, have not heard from server in 40592ms for sessionid > 0x703e48a8cfd81be6, closing socket connection and attempting reconnect > 2013-04-29 08:02:36,286 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server 10.152.152.84:2181. Will not attempt to > authenticate using SASL (Unable to locate a login configuration) > 2013-04-29 08:02:36,287 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to 10.152.152.84:2181, initiating session > 2013-04-29 08:02:36,288 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x703e48a8cfd81be6 has expired, > closing socket connection > 2013-04-29 08:03:16,287 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server= > ,60020,1367221255417: > regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be= 6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be= 6-0x703e48a8cfd81be6 > regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be= 6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be= 6-0x703e48a8cfd81be6 > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode =3D Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeep= erWatcher.java:389) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatche= r.java:286) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:5= 19) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > 2013-04-29 08:03:16,288 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server= > ,60020,1367221255417: Unhandled exception: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing ,60020,1367221255417 as dead server > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing ,60020,1367221255417 as dead server > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcc= essorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstr= uctorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException= .java:95) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteExceptio= n.java:79) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(H= RegionServer.java:880) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:= 748) > at java.lang.Thread.run(Thread.java:662)