Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DFD859824 for ; Thu, 5 Jan 2012 18:30:30 +0000 (UTC) Received: (qmail 46287 invoked by uid 500); 5 Jan 2012 18:30:29 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 46165 invoked by uid 500); 5 Jan 2012 18:30:28 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 46147 invoked by uid 99); 5 Jan 2012 18:30:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 18:30:28 +0000 X-ASF-Spam-Status: No, hits=-1.3 required=5.0 tests=FRT_ADOBE2,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of prattrs@adobe.com designates 64.18.1.31 as permitted sender) Received: from [64.18.1.31] (HELO exprod6og113.obsmtp.com) (64.18.1.31) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 18:30:19 +0000 Received: from outbound-smtp-1.corp.adobe.com ([192.150.11.134]) by exprod6ob113.postini.com ([64.18.5.12]) with SMTP ID DSNKTwXsJbwPeNjFTE75dHe+Kksil/h4gPpg@postini.com; Thu, 05 Jan 2012 10:29:58 PST Received: from inner-relay-1.corp.adobe.com ([153.32.1.51]) by outbound-smtp-1.corp.adobe.com (8.12.10/8.12.10) with ESMTP id q05IS7aa016046 for ; Thu, 5 Jan 2012 10:28:07 -0800 (PST) Received: from nahub01.corp.adobe.com (nahub01.corp.adobe.com [10.8.189.97]) by inner-relay-1.corp.adobe.com (8.12.10/8.12.10) with ESMTP id q05ITvL7015875 for ; Thu, 5 Jan 2012 10:29:57 -0800 (PST) Received: from NAMBX01.corp.adobe.com ([10.8.189.91]) by nahub01.corp.adobe.com ([10.8.189.97]) with mapi; Thu, 5 Jan 2012 10:29:57 -0800 From: Sandy Pratt To: "user@hbase.apache.org" Date: Thu, 5 Jan 2012 10:29:52 -0800 Subject: RE: RegionServer dying every two or three days Thread-Topic: RegionServer dying every two or three days Thread-Index: AczLrqkbnn5TfcOBSJesxYfnNL4V1QAJaaWA Message-ID: References: <001d01cccb7e$4cb35240$e619f6c0$%vasudevan@huawei.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org That size heap doesn't seem like it should cause a 36 second GC (a minor GC= even if I remember your logs correctly), so I tend to think that other thi= ngs are probably going on. This line here: 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs] 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=3D0.05 sys=3D0.01, real=3D36.96 secs] is really mysterious to me. It seems to indicate that the process was bloc= ked for almost 37 seconds during a minor collection. Note the CPU times ar= e very low but the wall time is very high. If it was actually doing GC wor= k, I'd expect to see user time higher than real time, as it is in other par= allel collections (see your log snippet). Were you really so CPU starved t= hat it took 37 seconds to get in 50ms of work? I can't make sense of that.= I'm trying to think of something that would block you for that long while= all your threads are stopped for GC, other than being in swap, but I can't= come up with anything. You're certain you're not in swap? Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while you t= roubleshoot? Why is your new size so small? This generally means that relatively more o= bjects are being tenured than would be with a larger new size. This could = make collections of the old gen worse (GC time is said to be proportional t= o the number of live objects in the generation, and CMS does indeed cause S= TW pauses). A typical new to tenured ratio might be 1:3. Were the new gen= GCs taking too long? This is probably orthogonal to your immediate issue,= though. -----Original Message----- From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br] Sent: Thursday, January 05, 2012 5:33 AM To: user@hbase.apache.org Subject: Re: RegionServer dying every two or three days St.Ack, I don't have made any attempt in GC tunning, yet. I will read the perf section as suggested. I'm currently using Nagios + JMX to monitor the cluster, but it's currently= used for alert only, the perfdata is not been stored, so it's kind of usel= ess right now, but i was thinking in use TSDB to store it, any known case o= f integration? --- Sandy, Yes, my timeout is 30 seconds: zookeeper.session.timeout 30000 To our application it's a sufferable time to wait in case a RegionServer go= offline. My heap is 4GB and my JVM params are: -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=3D70 -XX:NewSize=3D128m -XX:MaxNewSize= =3D128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCD= etails -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserve= r-gc.log I will try the -XX:+UseParallelOldGC param and post my feedback here. --- Ramkrishna, Seems the GC is the root of all evil in this case. ---- Thank you all for the answers. I will try out these valuable advices given = here and post my results. Leo Gamas. 2012/1/5 Ramkrishna S Vasudevan > Recently we faced a similar problem and it was due to GC config. Pls > check your GC. > > Regards > Ram > > -----Original Message----- > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of > Stack > Sent: Thursday, January 05, 2012 2:50 AM > To: user@hbase.apache.org > Subject: Re: RegionServer dying every two or three days > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas > wrote: > > The third line took 36.96 seconds to execute, can this be causing > > this problem? > > > > Probably. Have you made any attempt at GC tuning? > > > > Reading the code a little it seems that, even if it's disabled, if > > all files are target in a compaction, it's considered a major > > compaction. Is > it > > right? > > > > That is right. They get 'upgraded' from minor to major. > > This should be fine though. What you are avoiding setting major > compactions to 0 is all regions being major compacted on a period, a > heavy weight effective rewrite of all your data (unless already major > compacted). It looks like you have this disabled which is good until > you've wrestled your cluster into submission. > > > > The machines don't have swap, so the swappiness parameter don't seem > > to apply here. Any other suggestion? > > > > See the perf section of the hbase manual. It has our current list. > > Are you monitoring your cluster w/ ganglia or tsdb? > > > St.Ack > > > Thanks. > > > > 2012/1/4 Leonardo Gamas > > > >> I will investigate this, thanks for the response. > >> > >> > >> 2012/1/3 Sandy Pratt > >> > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed > >>> out, have not heard from server in 61103ms for sessionid > >>> 0x23462a4cf93a8fc, closing socket connection and attempting > >>> reconnect > >>> > >>> It looks like the process has been unresponsive for some time, so > >>> ZK > has > >>> terminated the session. Did you experience a long GC pause right > before > >>> this? If you don't have GC logging enabled for the RS, you can > sometimes > >>> tell by noticing a gap in the timestamps of the log statements > >>> leading > up > >>> to the crash. > >>> > >>> If it turns out to be GC, you might want to look at your kernel > >>> swappiness setting (set it to 0) and your JVM params. > >>> > >>> Sandy > >>> > >>> > >>> > -----Original Message----- > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br] > >>> > Sent: Thursday, December 29, 2011 07:44 > >>> > To: user@hbase.apache.org > >>> > Subject: RegionServer dying every two or three days > >>> > > >>> > Hi, > >>> > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 > >>> Slaves), > >>> > running on Amazon EC2. The master is a High-Memory Extra Large > Instance > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. > >>> > The slaves are Extra Large Instances (m1.xlarge) running > >>> > Datanode, > >>> TaskTracker, > >>> > RegionServer and Zookeeper. > >>> > > >>> > From time to time, every two or three days, one of the > >>> > RegionServers processes goes down, but the other processes > >>> > (DataNode, TaskTracker, > >>> > Zookeeper) continue normally. > >>> > > >>> > Reading the logs: > >>> > > >>> > The connection with Zookeeper timed out: > >>> > > >>> > --------------------------- > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session > >>> > timed > out, > >>> have > >>> > not heard from server in 61103ms for sessionid > >>> > 0x23462a4cf93a8fc, > >>> closing > >>> > socket connection and attempting reconnect > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session > >>> > timed > out, > >>> have > >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, > closing > >>> > socket connection and attempting reconnect > >>> > --------------------------- > >>> > > >>> > And the Handlers start to fail: > >>> > > >>> > --------------------------- > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, > >>> > call > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from > >>> > xx.xx.xx.xx:xxxx: output error > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on > 60020 > >>> > caught: java.nio.channels.ClosedChannelException > >>> > at > >>> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > >>> > 3) > >>> > at > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > >>> > 1341) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons > >>> > e(HB > >>> > aseServer.java:727) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas > >>> > eSe > >>> > rver.java:792) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > >>> > 083) > >>> > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, > >>> > call > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from > >>> > xx.xx.xx.xx:xxxx: output error > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on > 60020 > >>> > caught: java.nio.channels.ClosedChannelException > >>> > at > >>> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > >>> > 3) > >>> > at > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > >>> > 1341) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons > >>> > e(HB > >>> > aseServer.java:727) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas > >>> > eSe > >>> > rver.java:792) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > >>> > 083) > >>> > --------------------------- > >>> > > >>> > And finally the server throws a YouAreDeadException :( : > >>> > > >>> > --------------------------- > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket > connection > >>> to > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > initiating > >>> session > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect > >>> > to ZooKeeper service, session 0x23462a4cf93a8fc has expired, > >>> > closing > socket > >>> > connection > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket > connection > >>> to > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > initiating > >>> session > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect > >>> > to ZooKeeper service, session 0x346c561a55953e has expired, > >>> > closing > socket > >>> > connection > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING > >>> > region server > >>> > serverName=3Dip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, > >>> > load=3D(requests=3D447, regions=3D206, usedHeap=3D1584, maxHeap=3D4= 083): > >>> > Unhandled > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server > >>> > REPORT rejected; currently processing > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > >>> > org.apache.hadoop.hbase.YouAreDeadException: > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > >>> > rejected; currently processing > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > as > >>> > dead server > >>> > at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > >>> > Method) > >>> > at > >>> > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo > r > >>> > AccessorImpl.java:39) > >>> > at > >>> > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo > n > >>> > structorAccessorImpl.java:27) > >>> > at > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) > >>> > at > >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Remot > >>> > eExce > >>> > ption.java:95) > >>> > at > >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remo > >>> > te > >>> > Exception.java:79) > >>> > at > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ > >>> > erRep > >>> > ort(HRegionServer.java:735) > >>> > at > >>> > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j > >>> > ava:596) > >>> > at java.lang.Thread.run(Thread.java:662) > >>> > Caused by: org.apache.hadoop.ipc.RemoteException: > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > >>> > rejected; currently processing > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > as > >>> > dead server > >>> > at > >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerM > >>> > ana > >>> > ger.java:204) > >>> > at > >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport( > >>> > Serv > >>> > erManager.java:262) > >>> > at > >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaste > >>> > r.jav > >>> > a:669) > >>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown > Source) > >>> > at > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod > >>> > Acces > >>> > sorImpl.java:25) > >>> > at java.lang.reflect.Method.invoke(Method.java:597) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > >>> > 039) > >>> > > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.jav > >>> > a:257 > >>> > ) > >>> > at $Proxy6.regionServerReport(Unknown Source) > >>> > at > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ > >>> > erRep > >>> > ort(HRegionServer.java:729) > >>> > ... 2 more > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics: > >>> > requests=3D66, regions=3D206, stores=3D2078, storefiles=3D970, > >>> > storefileIndexSize=3D78, memstoreSize=3D796, compactionQueueSize=3D= 0, > >>> > flushQueueSize=3D0, usedHeap=3D1672, maxHeap=3D4083, > >>> > blockCacheSize=3D705907552, blockCacheFree=3D150412064, > >>> > blockCacheCount=3D10648, blockCacheHitCount=3D79578618, > >>> > blockCacheMissCount=3D3036335, blockCacheEvictedCount=3D1401352, > >>> > blockCacheHitRatio=3D96, > >>> > blockCacheHitCachingRatio=3D98 > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: > >>> > Unhandled > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server > >>> > REPORT rejected; currently processing > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020 > >>> > --------------------------- > >>> > > >>> > Then i restart the RegionServer and everything is back to normal. > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't > >>> > see any abnormality in the same time window. > >>> > I think it was caused by the lost of connection to zookeeper. Is > >>> > it > >>> advisable to > >>> > run zookeeper in the same machines? > >>> > if the RegionServer lost it's connection to Zookeeper, there's a > >>> > way > (a > >>> > configuration perhaps) to re-join the cluster, and not only die? > >>> > > >>> > Any idea what is causing this?? Or to prevent it from happening? > >>> > > >>> > Any help is appreciated. > >>> > > >>> > Best Regards, > >>> > > >>> > -- > >>> > > >>> > *Leonardo Gamas* > >>> > Software Engineer > >>> > +557134943514 > >>> > +557581347440 > >>> > leogamas@jusbrasil.com.br > >>> > www.jusbrasil.com.br > >>> > >> > >> > >> > >> -- > >> > >> *Leonardo Gamas* > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75) > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br > >> > >> > > > > > > -- > > > > *Leonardo Gamas* > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75) > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br > > -- *Leonardo Gamas* Software Engineer +557134943514 +557581347440 leogamas@jusbrasil.com.br www.jusbrasil.com.br