hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject RE: RegionServer dying every two or three days
Date Thu, 05 Jan 2012 18:29:52 GMT
That size heap doesn't seem like it should cause a 36 second GC (a minor GC even if I remember
your logs correctly), so I tend to think that other things are probably going on.

This line here:

14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs]
954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
real=36.96 secs]

is really mysterious to me.  It seems to indicate that the process was blocked for almost
37 seconds during a minor collection.  Note the CPU times are very low but the wall time is
very high.  If it was actually doing GC work, I'd expect to see user time higher than real
time, as it is in other parallel collections (see your log snippet).  Were you really so CPU
starved that it took 37 seconds to get in 50ms of work?  I can't make sense of that.  I'm
trying to think of something that would block you for that long while all your threads are
stopped for GC, other than being in swap, but I can't come up with anything.  You're certain
you're not in swap?

Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while you troubleshoot?

Why is your new size so small?  This generally means that relatively more objects are being
tenured than would be with a larger new size.  This could make collections of the old gen
worse (GC time is said to be proportional to the number of live objects in the generation,
and CMS does indeed cause STW pauses).  A typical new to tenured ratio might be 1:3.  Were
the new gen GCs taking too long?  This is probably orthogonal to your immediate issue, though.



-----Original Message-----
From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
Sent: Thursday, January 05, 2012 5:33 AM
To: user@hbase.apache.org
Subject: Re: RegionServer dying every two or three days

 St.Ack,

I don't have made any attempt in GC tunning, yet.
I will read the perf section as suggested.
I'm currently using Nagios + JMX to monitor the cluster, but it's currently used for alert
only, the perfdata is not been stored, so it's kind of useless right now, but i was thinking
in use TSDB to store it, any known case of integration?
---

Sandy,

Yes, my timeout is 30 seconds:

<property>
   <name>zookeeper.session.timeout</name>
   <value>30000</value>
</property>

To our application it's a sufferable time to wait in case a RegionServer go offline.

My heap is 4GB and my JVM params are:

-Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
-XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log

I will try the -XX:+UseParallelOldGC param and post my feedback here.
---

Ramkrishna,

Seems the GC is the root of all evil in this case.
----

Thank you all for the answers. I will try out these valuable advices given here and post my
results.

Leo Gamas.

2012/1/5 Ramkrishna S Vasudevan <ramkrishna.vasudevan@huawei.com>

> Recently we faced a similar problem and it was due to GC config.  Pls
> check your GC.
>
> Regards
> Ram
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Thursday, January 05, 2012 2:50 AM
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
> On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> <leogamas@jusbrasil.com.br> wrote:
> > The third line took 36.96 seconds to execute, can this be causing
> > this problem?
> >
>
> Probably.  Have you made any attempt at GC tuning?
>
>
> > Reading the code a little it seems that, even if it's disabled, if
> > all files are target in a compaction, it's considered a major
> > compaction. Is
> it
> > right?
> >
>
> That is right.  They get 'upgraded' from minor to major.
>
> This should be fine though.  What you are avoiding setting major
> compactions to 0 is all regions being major compacted on a period, a
> heavy weight effective rewrite of all your data (unless already major
> compacted).   It looks like you have this disabled which is good until
> you've wrestled your cluster into submission.
>
>
> > The machines don't have swap, so the swappiness parameter don't seem
> > to apply here. Any other suggestion?
> >
>
> See the perf section of the hbase manual.  It has our current list.
>
> Are you monitoring your cluster w/ ganglia or tsdb?
>
>
> St.Ack
>
> > Thanks.
> >
> > 2012/1/4 Leonardo Gamas <leogamas@jusbrasil.com.br>
> >
> >> I will investigate this, thanks for the response.
> >>
> >>
> >> 2012/1/3 Sandy Pratt <prattrs@adobe.com>
> >>
> >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >>> out, have not heard from server in 61103ms for sessionid
> >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> >>> reconnect
> >>>
> >>> It looks like the process has been unresponsive for some time, so
> >>> ZK
> has
> >>> terminated the session.  Did you experience a long GC pause right
> before
> >>> this?  If you don't have GC logging enabled for the RS, you can
> sometimes
> >>> tell by noticing a gap in the timestamps of the log statements
> >>> leading
> up
> >>> to the crash.
> >>>
> >>> If it turns out to be GC, you might want to look at your kernel
> >>> swappiness setting (set it to 0) and your JVM params.
> >>>
> >>> Sandy
> >>>
> >>>
> >>> > -----Original Message-----
> >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >>> > Sent: Thursday, December 29, 2011 07:44
> >>> > To: user@hbase.apache.org
> >>> > Subject: RegionServer dying every two or three days
> >>> >
> >>> > Hi,
> >>> >
> >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> >>> Slaves),
> >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> Instance
> >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
> >>> > The slaves are Extra Large Instances (m1.xlarge) running
> >>> > Datanode,
> >>> TaskTracker,
> >>> > RegionServer and Zookeeper.
> >>> >
> >>> > From time to time, every two or three days, one of the
> >>> > RegionServers processes goes down, but the other processes
> >>> > (DataNode, TaskTracker,
> >>> > Zookeeper) continue normally.
> >>> >
> >>> > Reading the logs:
> >>> >
> >>> > The connection with Zookeeper timed out:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >>> > timed
> out,
> >>> have
> >>> > not heard from server in 61103ms for sessionid
> >>> > 0x23462a4cf93a8fc,
> >>> closing
> >>> > socket connection and attempting reconnect
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >>> > timed
> out,
> >>> have
> >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> closing
> >>> > socket connection and attempting reconnect
> >>> > ---------------------------
> >>> >
> >>> > And the Handlers start to fail:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> >>> > call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> >>> > e(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> >>> > eSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> >
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> >>> > call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> >>> > e(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> >>> > eSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> > ---------------------------
> >>> >
> >>> > And finally the server throws a YouAreDeadException :( :
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> >>> > to ZooKeeper service, session 0x23462a4cf93a8fc has expired,
> >>> > closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> >>> > to ZooKeeper service, session 0x346c561a55953e has expired,
> >>> > closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> >>> > region server
> >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >>> > REPORT rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > org.apache.hadoop.hbase.YouAreDeadException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>> > Method)
> >>> >         at
> >>> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> r
> >>> > AccessorImpl.java:39)
> >>> >         at
> >>> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> n
> >>> > structorAccessorImpl.java:27)
> >>> >         at
> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Remot
> >>> > eExce
> >>> > ption.java:95)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remo
> >>> > te
> >>> > Exception.java:79)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> >>> > erRep
> >>> > ort(HRegionServer.java:735)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> >>> > ava:596)
> >>> >         at java.lang.Thread.run(Thread.java:662)
> >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerM
> >>> > ana
> >>> > ger.java:204)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> >>> > Serv
> >>> > erManager.java:262)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaste
> >>> > r.jav
> >>> > a:669)
> >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >>> >         at
> >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
> >>> > Acces
> >>> > sorImpl.java:25)
> >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 039)
> >>> >
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.jav
> >>> > a:257
> >>> > )
> >>> >         at $Proxy6.regionServerReport(Unknown Source)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> >>> > erRep
> >>> > ort(HRegionServer.java:729)
> >>> >         ... 2 more
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> >>> > requests=66, regions=206, stores=2078, storefiles=970,
> >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> >>> > blockCacheSize=705907552, blockCacheFree=150412064,
> >>> > blockCacheCount=10648, blockCacheHitCount=79578618,
> >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> >>> > blockCacheHitRatio=96,
> >>> > blockCacheHitCachingRatio=98
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >>> > REPORT rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> >>> > ---------------------------
> >>> >
> >>> > Then i restart the RegionServer and everything is back to normal.
> >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
> >>> > see any abnormality in the same time window.
> >>> > I think it was caused by the lost of connection to zookeeper. Is
> >>> > it
> >>> advisable to
> >>> > run zookeeper in the same machines?
> >>> > if the RegionServer lost it's connection to Zookeeper, there's a
> >>> > way
> (a
> >>> > configuration perhaps) to re-join the cluster, and not only die?
> >>> >
> >>> > Any idea what is causing this?? Or to prevent it from happening?
> >>> >
> >>> > Any help is appreciated.
> >>> >
> >>> > Best Regards,
> >>> >
> >>> > --
> >>> >
> >>> > *Leonardo Gamas*
> >>> > Software Engineer
> >>> > +557134943514
> >>> > +557581347440
> >>> > leogamas@jusbrasil.com.br
> >>> > www.jusbrasil.com.br
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Leonardo Gamas*
> >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >>
> >>
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
>
>


--

*Leonardo Gamas*
Software Engineer
+557134943514
+557581347440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Mime
View raw message