Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of prattrs@adobe.com designates
 64.18.1.31 as permitted sender)
From: Sandy Pratt <prattrs@adobe.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Date: Thu, 5 Jan 2012 10:29:52 -0800
Subject: RE: RegionServer dying every two or three days
Thread-Topic: RegionServer dying every two or three days
Thread-Index: AczLrqkbnn5TfcOBSJesxYfnNL4V1QAJaaWA
Message-ID: 
 <A0B699C30D50C84883E79752AEA74B883072576C94@NAMBX01.corp.adobe.com>
References: 
 <CAKS9REK5cOGV6Vykr1uBto5m0-Cgh1i068tddKbrgVxRNp1oMA@mail.gmail.com>
	<A0B699C30D50C84883E79752AEA74B8830724245C7@NAMBX01.corp.adobe.com>
	<CAKS9REKk0beSSt1B9_10cofP-MhADj-m7QFsm+JavzZE63R51Q@mail.gmail.com>
	<CAKS9REJcBjgnj_b9s+zBThSAovFoRTTgNffEi_Ob-1K69e97nw@mail.gmail.com>
	<CADcMMgE+Yzr=_sTdOVOp7p+BEpYSP9gXDxWPqHHVLW_2haXUnA@mail.gmail.com>
	<001d01cccb7e$4cb35240$e619f6c0$%vasudevan@huawei.com>
 <CAKS9RELqSTEv_fhyQa9HGdr0OK1Rv3NbT+Z9JrvhUP=OnvGM+g@mail.gmail.com>
In-Reply-To: 
 <CAKS9RELqSTEv_fhyQa9HGdr0OK1Rv3NbT+Z9JrvhUP=OnvGM+g@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

That size heap doesn't seem like it should cause a 36 second GC (a minor GC=
 even if I remember your logs correctly), so I tend to think that other thi=
ngs are probably going on.

This line here:

14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs]
954388K->849478K(1705776K), 0.0364200 secs] [Times: user=3D0.05 sys=3D0.01,
real=3D36.96 secs]

is really mysterious to me.  It seems to indicate that the process was bloc=
ked for almost 37 seconds during a minor collection.  Note the CPU times ar=
e very low but the wall time is very high.  If it was actually doing GC wor=
k, I'd expect to see user time higher than real time, as it is in other par=
allel collections (see your log snippet).  Were you really so CPU starved t=
hat it took 37 seconds to get in 50ms of work?  I can't make sense of that.=
  I'm trying to think of something that would block you for that long while=
 all your threads are stopped for GC, other than being in swap, but I can't=
 come up with anything.  You're certain you're not in swap?

Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while you t=
roubleshoot?

Why is your new size so small?  This generally means that relatively more o=
bjects are being tenured than would be with a larger new size.  This could =
make collections of the old gen worse (GC time is said to be proportional t=
o the number of live objects in the generation, and CMS does indeed cause S=
TW pauses).  A typical new to tenured ratio might be 1:3.  Were the new gen=
 GCs taking too long?  This is probably orthogonal to your immediate issue,=
 though.


-----Original Message-----
From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
Sent: Thursday, January 05, 2012 5:33 AM
To: user@hbase.apache.org
Subject: Re: RegionServer dying every two or three days

 St.Ack,

I don't have made any attempt in GC tunning, yet.
I will read the perf section as suggested.
I'm currently using Nagios + JMX to monitor the cluster, but it's currently=
 used for alert only, the perfdata is not been stored, so it's kind of usel=
ess right now, but i was thinking in use TSDB to store it, any known case o=
f integration?
---

Sandy,

Yes, my timeout is 30 seconds:

<property>
   <name>zookeeper.session.timeout</name>
   <value>30000</value>
</property>

To our application it's a sufferable time to wait in case a RegionServer go=
 offline.

My heap is 4GB and my JVM params are:

-Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=3D70 -XX:NewSize=3D128m -XX:MaxNewSize=
=3D128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCD=
etails -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserve=
r-gc.log

I will try the -XX:+UseParallelOldGC param and post my feedback here.
---

Ramkrishna,

Seems the GC is the root of all evil in this case.
----

Thank you all for the answers. I will try out these valuable advices given =
here and post my results.

Leo Gamas.

2012/1/5 Ramkrishna S Vasudevan <ramkrishna.vasudevan@huawei.com>

> Recently we faced a similar problem and it was due to GC config.  Pls
> check your GC.
>
> Regards
> Ram
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Thursday, January 05, 2012 2:50 AM
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
> On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> <leogamas@jusbrasil.com.br> wrote:
> > The third line took 36.96 seconds to execute, can this be causing
> > this problem?
> >
>
> Probably.  Have you made any attempt at GC tuning?
>
>
> > Reading the code a little it seems that, even if it's disabled, if
> > all files are target in a compaction, it's considered a major
> > compaction. Is
> it
> > right?
> >
>
> That is right.  They get 'upgraded' from minor to major.
>
> This should be fine though.  What you are avoiding setting major
> compactions to 0 is all regions being major compacted on a period, a
> heavy weight effective rewrite of all your data (unless already major
> compacted).   It looks like you have this disabled which is good until
> you've wrestled your cluster into submission.
>
>
> > The machines don't have swap, so the swappiness parameter don't seem
> > to apply here. Any other suggestion?
> >
>
> See the perf section of the hbase manual.  It has our current list.
>
> Are you monitoring your cluster w/ ganglia or tsdb?
>
>
> St.Ack
>
> > Thanks.
> >
> > 2012/1/4 Leonardo Gamas <leogamas@jusbrasil.com.br>
> >
> >> I will investigate this, thanks for the response.
> >>
> >>
> >> 2012/1/3 Sandy Pratt <prattrs@adobe.com>
> >>
> >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >>> out, have not heard from server in 61103ms for sessionid
> >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> >>> reconnect
> >>>
> >>> It looks like the process has been unresponsive for some time, so
> >>> ZK
> has
> >>> terminated the session.  Did you experience a long GC pause right
> before
> >>> this?  If you don't have GC logging enabled for the RS, you can
> sometimes
> >>> tell by noticing a gap in the timestamps of the log statements
> >>> leading
> up
> >>> to the crash.
> >>>
> >>> If it turns out to be GC, you might want to look at your kernel
> >>> swappiness setting (set it to 0) and your JVM params.
> >>>
> >>> Sandy
> >>>
> >>>
> >>> > -----Original Message-----
> >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >>> > Sent: Thursday, December 29, 2011 07:44
> >>> > To: user@hbase.apache.org
> >>> > Subject: RegionServer dying every two or three days
> >>> >
> >>> > Hi,
> >>> >
> >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> >>> Slaves),
> >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> Instance
> >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
> >>> > The slaves are Extra Large Instances (m1.xlarge) running
> >>> > Datanode,
> >>> TaskTracker,
> >>> > RegionServer and Zookeeper.
> >>> >
> >>> > From time to time, every two or three days, one of the
> >>> > RegionServers processes goes down, but the other processes
> >>> > (DataNode, TaskTracker,
> >>> > Zookeeper) continue normally.
> >>> >
> >>> > Reading the logs:
> >>> >
> >>> > The connection with Zookeeper timed out:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >>> > timed
> out,
> >>> have
> >>> > not heard from server in 61103ms for sessionid
> >>> > 0x23462a4cf93a8fc,
> >>> closing
> >>> > socket connection and attempting reconnect
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >>> > timed
> out,
> >>> have
> >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> closing
> >>> > socket connection and attempting reconnect
> >>> > ---------------------------
> >>> >
> >>> > And the Handlers start to fail:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> >>> > call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> >>> > e(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> >>> > eSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> >
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> >>> > call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> >>> > e(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> >>> > eSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> > ---------------------------
> >>> >
> >>> > And finally the server throws a YouAreDeadException :( :
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> >>> > to ZooKeeper service, session 0x23462a4cf93a8fc has expired,
> >>> > closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> >>> > to ZooKeeper service, session 0x346c561a55953e has expired,
> >>> > closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> >>> > region server
> >>> > serverName=3Dip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >>> > load=3D(requests=3D447, regions=3D206, usedHeap=3D1584, maxHeap=3D4=
083):
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >>> > REPORT rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > org.apache.hadoop.hbase.YouAreDeadException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>> > Method)
> >>> >         at
> >>> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> r
> >>> > AccessorImpl.java:39)
> >>> >         at
> >>> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> n
> >>> > structorAccessorImpl.java:27)
> >>> >         at
> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Remot
> >>> > eExce
> >>> > ption.java:95)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remo
> >>> > te
> >>> > Exception.java:79)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> >>> > erRep
> >>> > ort(HRegionServer.java:735)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> >>> > ava:596)
> >>> >         at java.lang.Thread.run(Thread.java:662)
> >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerM
> >>> > ana
> >>> > ger.java:204)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> >>> > Serv
> >>> > erManager.java:262)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaste
> >>> > r.jav
> >>> > a:669)
> >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >>> >         at
> >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
> >>> > Acces
> >>> > sorImpl.java:25)
> >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 039)
> >>> >
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.jav
> >>> > a:257
> >>> > )
> >>> >         at $Proxy6.regionServerReport(Unknown Source)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> >>> > erRep
> >>> > ort(HRegionServer.java:729)
> >>> >         ... 2 more
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> >>> > requests=3D66, regions=3D206, stores=3D2078, storefiles=3D970,
> >>> > storefileIndexSize=3D78, memstoreSize=3D796, compactionQueueSize=3D=
0,
> >>> > flushQueueSize=3D0, usedHeap=3D1672, maxHeap=3D4083,
> >>> > blockCacheSize=3D705907552, blockCacheFree=3D150412064,
> >>> > blockCacheCount=3D10648, blockCacheHitCount=3D79578618,
> >>> > blockCacheMissCount=3D3036335, blockCacheEvictedCount=3D1401352,
> >>> > blockCacheHitRatio=3D96,
> >>> > blockCacheHitCachingRatio=3D98
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >>> > REPORT rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> >>> > ---------------------------
> >>> >
> >>> > Then i restart the RegionServer and everything is back to normal.
> >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
> >>> > see any abnormality in the same time window.
> >>> > I think it was caused by the lost of connection to zookeeper. Is
> >>> > it
> >>> advisable to
> >>> > run zookeeper in the same machines?
> >>> > if the RegionServer lost it's connection to Zookeeper, there's a
> >>> > way
> (a
> >>> > configuration perhaps) to re-join the cluster, and not only die?
> >>> >
> >>> > Any idea what is causing this?? Or to prevent it from happening?
> >>> >
> >>> > Any help is appreciated.
> >>> >
> >>> > Best Regards,
> >>> >
> >>> > --
> >>> >
> >>> > *Leonardo Gamas*
> >>> > Software Engineer
> >>> > +557134943514
> >>> > +557581347440
> >>> > leogamas@jusbrasil.com.br
> >>> > www.jusbrasil.com.br
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Leonardo Gamas*
> >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >>
> >>
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
>
>


--

*Leonardo Gamas*
Software Engineer
+557134943514
+557581347440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br