hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leonardo Gamas <leoga...@jusbrasil.com.br>
Subject Re: RegionServer dying every two or three days
Date Wed, 04 Jan 2012 20:17:13 GMT
Sandy,

It happened again:

-----
12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 46571ms for sessionid 0x334a8c09b080033, closing
socket connection and attempting reconnect
12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 46513ms for sessionid 0x334a8c09b080034, closing
socket connection and attempting reconnect
-----

Comparing with the lines in the gc log, i found these lines around the time
client session expired:

-----
14248.678: [GC 14248.679: [ParNew: 105291K->457K(118016K), 0.0221210 secs]
954217K->849457K(1705776K), 0.0222870 secs] [Times: user=0.05 sys=0.00,
real=0.03 secs]
14249.959: [GC 14249.960: [ParNew: 105417K->392K(118016K), 0.0087140 secs]
954417K->849428K(1705776K), 0.0089260 secs] [Times: user=0.01 sys=0.01,
real=0.01 secs]
14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs]
954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
real=36.96 secs]
14296.604: [GC 14296.604: [ParNew: 105369K->523K(118016K), 0.0119440 secs]
954434K->849708K(1705776K), 0.0121650 secs] [Times: user=0.03 sys=0.00,
real=0.01 secs]
-----

The third line took 36.96 seconds to execute, can this be causing this
problem?

I notice too some major compactions happening, even after i disabled it,
with:

<property>
  <name>hbase.hregion.majorcompaction</name>
  <value>0</value>
</property>

Reading the code a little it seems that, even if it's disabled, if all
files are target in a compaction, it's considered a major compaction. Is it
right?

The machines don't have swap, so the swappiness parameter don't seem to
apply here. Any other suggestion?

Thanks.

2012/1/4 Leonardo Gamas <leogamas@jusbrasil.com.br>

> I will investigate this, thanks for the response.
>
>
> 2012/1/3 Sandy Pratt <prattrs@adobe.com>
>
>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>> closing socket connection and attempting reconnect
>>
>> It looks like the process has been unresponsive for some time, so ZK has
>> terminated the session.  Did you experience a long GC pause right before
>> this?  If you don't have GC logging enabled for the RS, you can sometimes
>> tell by noticing a gap in the timestamps of the log statements leading up
>> to the crash.
>>
>> If it turns out to be GC, you might want to look at your kernel
>> swappiness setting (set it to 0) and your JVM params.
>>
>> Sandy
>>
>>
>> > -----Original Message-----
>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
>> > Sent: Thursday, December 29, 2011 07:44
>> > To: user@hbase.apache.org
>> > Subject: RegionServer dying every two or three days
>> >
>> > Hi,
>> >
>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
>> Slaves),
>> > running on Amazon EC2. The master is a High-Memory Extra Large Instance
>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
>> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
>> TaskTracker,
>> > RegionServer and Zookeeper.
>> >
>> > From time to time, every two or three days, one of the RegionServers
>> > processes goes down, but the other processes (DataNode, TaskTracker,
>> > Zookeeper) continue normally.
>> >
>> > Reading the logs:
>> >
>> > The connection with Zookeeper timed out:
>> >
>> > ---------------------------
>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>> have
>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>> closing
>> > socket connection and attempting reconnect
>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>> have
>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
>> > socket connection and attempting reconnect
>> > ---------------------------
>> >
>> > And the Handlers start to fail:
>> >
>> > ---------------------------
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
>> > xx.xx.xx.xx:xxxx: output error
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
>> > caught: java.nio.channels.ClosedChannelException
>> >         at
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>> > 3)
>> >         at
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > 1341)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>> > aseServer.java:727)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>> > rver.java:792)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>> > 083)
>> >
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
>> > xx.xx.xx.xx:xxxx: output error
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
>> > caught: java.nio.channels.ClosedChannelException
>> >         at
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>> > 3)
>> >         at
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > 1341)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>> > aseServer.java:727)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>> > rver.java:792)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>> > 083)
>> > ---------------------------
>> >
>> > And finally the server throws a YouAreDeadException :( :
>> >
>> > ---------------------------
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>> to
>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>> session
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
>> > connection
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>> to
>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>> session
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket
>> > connection
>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
>> > Unhandled
>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > rejected; currently processing
>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > org.apache.hadoop.hbase.YouAreDeadException:
>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>> > dead server
>> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> >         at
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
>> > AccessorImpl.java:39)
>> >         at
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
>> > structorAccessorImpl.java:27)
>> >         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> >         at
>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
>> > ption.java:95)
>> >         at
>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
>> > Exception.java:79)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>> > ort(HRegionServer.java:735)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
>> > ava:596)
>> >         at java.lang.Thread.run(Thread.java:662)
>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>> > dead server
>> >         at
>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
>> > ger.java:204)
>> >         at
>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
>> > erManager.java:262)
>> >         at
>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>> > a:669)
>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >         at
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>> > sorImpl.java:25)
>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>> > 039)
>> >
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>> > )
>> >         at $Proxy6.regionServerReport(Unknown Source)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>> > ort(HRegionServer.java:729)
>> >         ... 2 more
>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
>> > requests=66, regions=206, stores=2078, storefiles=970,
>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
>> > blockCacheSize=705907552, blockCacheFree=150412064,
>> > blockCacheCount=10648, blockCacheHitCount=79578618,
>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
>> > blockCacheHitRatio=96,
>> > blockCacheHitCachingRatio=98
>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > rejected; currently processing
>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
>> > ---------------------------
>> >
>> > Then i restart the RegionServer and everything is back to normal.
>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
>> > abnormality in the same time window.
>> > I think it was caused by the lost of connection to zookeeper. Is it
>> advisable to
>> > run zookeeper in the same machines?
>> > if the RegionServer lost it's connection to Zookeeper, there's a way (a
>> > configuration perhaps) to re-join the cluster, and not only die?
>> >
>> > Any idea what is causing this?? Or to prevent it from happening?
>> >
>> > Any help is appreciated.
>> >
>> > Best Regards,
>> >
>> > --
>> >
>> > *Leonardo Gamas*
>> > Software Engineer
>> > +557134943514
>> > +557581347440
>> > leogamas@jusbrasil.com.br
>> > www.jusbrasil.com.br
>>
>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer/Chaos Monkey Engineer
> T (71) 3494-3514
> C (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
>
>


-- 

*Leonardo Gamas*
Software Engineer/Chaos Monkey Engineer
T (71) 3494-3514
C (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message