hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject RE: RegionServer dying every two or three days
Date Wed, 04 Jan 2012 21:32:51 GMT
It seems like that 36 second GC could do it if it was a stop the world pause.  Is your ZK session
timeout less than 36 seconds?

How large is your heap?  What JVM params are you running with?

If you're not collecting swapped out memory, then is it the case that concurrent mode failures
are just too expensive on your nodes?  The concurrent collector fails back to serial old gen
collection when the concurrent collector fails, which can be pretty slow.  I've heard 10 sec
per GB of memory as a rule of thumb.  You could try the parallel compacting old gen collector
and see what you get (-XX:+UseParallelOldGC).  That should make your worst case GC much less
expensive at the cost of more frequent latency spikes.  I usually see about 1-3 sec per GB
of memory with that collector (which makes me think the 10 sec per GB number for the serial
collector is wrong, but whatever).  With some young gen tuning, you can wind up with infrequent
full GCs which you know you can survive, as opposed to infrequent but often deadly concurrent
mode failures.

Another factor to consider is the role the full GCs play in cleaning up the finalization queue,
which I think can interact with off heap direct buffer allocations in a bad way.  I don't
pretend to be an expert here, but here are some clues:

http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=ae283c11508fb97ede5fe27a1554b?bug_id=4469299
https://groups.google.com/group/asynchbase/browse_thread/thread/c45bc7ba788b2357?pli=1

On some of my app servers, I've noticed a curious situation where stage (with low traffic)
would be gigs into swap for no good reason while prod (with high traffic) would be running
stably with a reasonably sized heap (same config and hardware on both envs).  Switching to
the parallel old gen collector, which in turn causes full GC to be much more frequent, corrected
this problem in an immediate and obvious way.  The time of the change was instantly apparent
in the cacti graphs, for example.  I can't prove it, but it's my hunch that NIO and off-heap
byte buffer allocations (sometimes behind the scenes, see above) have made the CMS collector
very risky for some long-running java applications.  For those that use it for HBase, I'd
be curious to hear how frequent your full GCs are.

I think that's far enough afield for now!  Hope this helps somewhat.

Sandy

> -----Original Message-----
> From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> Sent: Wednesday, January 04, 2012 12:17
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
> Sandy,
>
> It happened again:
>
> -----
> 12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 46571ms for sessionid 0x334a8c09b080033, closing
> socket connection and attempting reconnect
> 12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 46513ms for sessionid 0x334a8c09b080034, closing
> socket connection and attempting reconnect
> -----
>
> Comparing with the lines in the gc log, i found these lines around the time
> client session expired:
>
> -----
> 14248.678: [GC 14248.679: [ParNew: 105291K->457K(118016K), 0.0221210
> secs]
> 954217K->849457K(1705776K), 0.0222870 secs] [Times: user=0.05 sys=0.00,
> real=0.03 secs]
> 14249.959: [GC 14249.960: [ParNew: 105417K->392K(118016K), 0.0087140
> secs]
> 954417K->849428K(1705776K), 0.0089260 secs] [Times: user=0.01 sys=0.01,
> real=0.01 secs]
> 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
> secs]
> 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
> real=36.96 secs]
> 14296.604: [GC 14296.604: [ParNew: 105369K->523K(118016K), 0.0119440
> secs]
> 954434K->849708K(1705776K), 0.0121650 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
> -----
>
> The third line took 36.96 seconds to execute, can this be causing this
> problem?
>
> I notice too some major compactions happening, even after i disabled it,
> with:
>
> <property>
>   <name>hbase.hregion.majorcompaction</name>
>   <value>0</value>
> </property>
>
> Reading the code a little it seems that, even if it's disabled, if all files are
> target in a compaction, it's considered a major compaction. Is it right?
>
> The machines don't have swap, so the swappiness parameter don't seem to
> apply here. Any other suggestion?
>
> Thanks.
>
> 2012/1/4 Leonardo Gamas <leogamas@jusbrasil.com.br>
>
> > I will investigate this, thanks for the response.
> >
> >
> > 2012/1/3 Sandy Pratt <prattrs@adobe.com>
> >
> >> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >> out, have not heard from server in 61103ms for sessionid
> >> 0x23462a4cf93a8fc, closing socket connection and attempting reconnect
> >>
> >> It looks like the process has been unresponsive for some time, so ZK
> >> has terminated the session.  Did you experience a long GC pause right
> >> before this?  If you don't have GC logging enabled for the RS, you
> >> can sometimes tell by noticing a gap in the timestamps of the log
> >> statements leading up to the crash.
> >>
> >> If it turns out to be GC, you might want to look at your kernel
> >> swappiness setting (set it to 0) and your JVM params.
> >>
> >> Sandy
> >>
> >>
> >> > -----Original Message-----
> >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >> > Sent: Thursday, December 29, 2011 07:44
> >> > To: user@hbase.apache.org
> >> > Subject: RegionServer dying every two or three days
> >> >
> >> > Hi,
> >> >
> >> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> >> Slaves),
> >> > running on Amazon EC2. The master is a High-Memory Extra Large
> >> > Instance
> >> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> >> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
> >> TaskTracker,
> >> > RegionServer and Zookeeper.
> >> >
> >> > From time to time, every two or three days, one of the
> >> > RegionServers processes goes down, but the other processes
> >> > (DataNode, TaskTracker,
> >> > Zookeeper) continue normally.
> >> >
> >> > Reading the logs:
> >> >
> >> > The connection with Zookeeper timed out:
> >> >
> >> > ---------------------------
> >> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >> > out,
> >> have
> >> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> >> closing
> >> > socket connection and attempting reconnect
> >> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >> > out,
> >> have
> >> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> >> > closing socket connection and attempting reconnect
> >> > ---------------------------
> >> >
> >> > And the Handlers start to fail:
> >> >
> >> > ---------------------------
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> >> > xx.xx.xx.xx:xxxx: output error
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> >> > 60020
> >> > caught: java.nio.channels.ClosedChannelException
> >> >         at
> >> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java
> >> > :13
> >> > 3)
> >> >         at
> >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > 1341)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(H
> >> > B
> >> > aseServer.java:727)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >> > rver.java:792)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.jav
> >> > a:1
> >> > 083)
> >> >
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> >> > xx.xx.xx.xx:xxxx: output error
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> >> > 60020
> >> > caught: java.nio.channels.ClosedChannelException
> >> >         at
> >> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java
> >> > :13
> >> > 3)
> >> >         at
> >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > 1341)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(H
> >> > B
> >> > aseServer.java:727)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >> > rver.java:792)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.jav
> >> > a:1
> >> > 083)
> >> > ---------------------------
> >> >
> >> > And finally the server throws a YouAreDeadException :( :
> >> >
> >> > ---------------------------
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > connection
> >> to
> >> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > initiating
> >> session
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing
> >> > socket connection
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > connection
> >> to
> >> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > initiating
> >> session
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >> > ZooKeeper service, session 0x346c561a55953e has expired, closing
> >> > socket connection
> >> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> >> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> >> > Unhandled
> >> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >> > REPORT rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> > org.apache.hadoop.hbase.YouAreDeadException:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> >         at
> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > Method)
> >> >         at
> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstru
> >> > ctor
> >> > AccessorImpl.java:39)
> >> >         at
> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegatin
> >> > gCon
> >> > structorAccessorImpl.java:27)
> >> >         at
> >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >> >         at
> >> >
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteEx
> >> > ce
> >> > ption.java:95)
> >> >         at
> >> >
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> >> > Exception.java:79)
> >> >         at
> >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerR
> >> > ep
> >> > ort(HRegionServer.java:735)
> >> >         at
> >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServe
> >> > r.j
> >> > ava:596)
> >> >         at java.lang.Thread.run(Thread.java:662)
> >> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> >         at
> >> >
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> >> > ger.java:204)
> >> >         at
> >> >
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Ser
> >> > v
> >> > erManager.java:262)
> >> >         at
> >> >
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.j
> >> > av
> >> > a:669)
> >> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >> >         at
> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
> >> > es
> >> > sorImpl.java:25)
> >> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.jav
> >> > a:1
> >> > 039)
> >> >
> >> >         at
> >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:2
> >> > 57
> >> > )
> >> >         at $Proxy6.regionServerReport(Unknown Source)
> >> >         at
> >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerR
> >> > ep
> >> > ort(HRegionServer.java:729)
> >> >         ... 2 more
> >> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> >> > requests=66, regions=206, stores=2078, storefiles=970,
> >> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> >> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> >> > blockCacheSize=705907552, blockCacheFree=150412064,
> >> > blockCacheCount=10648, blockCacheHitCount=79578618,
> >> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> >> > blockCacheHitRatio=96,
> >> > blockCacheHitCachingRatio=98
> >> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> >> > Unhandled
> >> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >> > REPORT rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> >> > ---------------------------
> >> >
> >> > Then i restart the RegionServer and everything is back to normal.
> >> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see
> >> > any abnormality in the same time window.
> >> > I think it was caused by the lost of connection to zookeeper. Is it
> >> advisable to
> >> > run zookeeper in the same machines?
> >> > if the RegionServer lost it's connection to Zookeeper, there's a
> >> > way (a configuration perhaps) to re-join the cluster, and not only die?
> >> >
> >> > Any idea what is causing this?? Or to prevent it from happening?
> >> >
> >> > Any help is appreciated.
> >> >
> >> > Best Regards,
> >> >
> >> > --
> >> >
> >> > *Leonardo Gamas*
> >> > Software Engineer
> >> > +557134943514
> >> > +557581347440
> >> > leogamas@jusbrasil.com.br
> >> > www.jusbrasil.com.br
> >>
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >
> >
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer/Chaos Monkey Engineer
> T (71) 3494-3514
> C (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br

Mime
View raw message