hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Reddy <sandeepvre...@outlook.com>
Subject Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout
Date Tue, 13 Dec 2016 08:13:44 GMT
This week also we are facing same problem.

At lest from 4 to 5 months we haven't changed any HBase configuration.

All of sudden we are started seeing this pattern where regionservers are getting killed due
GC pause & later zookeeper timeouts.

We are using 5 GB for HBase heap & 6 GB for bucket cache.


Following is the log from one of the regionserver:

2016-12-12 17:38:59,142 WARN  [regionserver60020.periodicFlusher] util.Sleeper: We slept 30938ms
instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually
bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-12-12 17:40:25,076 INFO  [SplitLogWorker-HOST30,60020,1481556258263-SendThread(HOSTM5:4181)]
zookeeper.ClientCnxn: Socket connection established to HOSTM5/192.168.190.179:4181, initiating
session
2016-12-12 17:38:59,142 WARN  [regionserver60020] util.Sleeper: We slept 19044ms instead of
3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-12-12 17:38:54,384 WARN  [regionserver60020.compactionChecker] util.Sleeper: We slept
23805ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's
usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-12-12 17:38:16,281 INFO  [regionserver60020-SendThread(HOSTM3:4181)] zookeeper.ClientCnxn:
Socket connection established to HOSTM3/192.168.167.7:4181, initiating session
2016-12-12 17:40:25,091 INFO  [regionserver60020-SendThread(HOSTM1:4181)] zookeeper.ClientCnxn:
Socket connection established to HOSTM1/192.168.178.226:4181, initiating session
2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM3:4181)] zookeeper.ClientCnxn:
Client session timed out, have not heard from server in 128812ms for sessionid 0x558f30318e204de,
closing socket connection and attempting reconnect
2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM2:4181)] zookeeper.ClientCnxn:
Unable to reconnect to ZooKeeper service, session 0x558f30318e204df has expired, closing socket
connection
2016-12-12 17:40:25,093 INFO  [SplitLogWorker-HOST30,60020,1481556258263-SendThread(HOSTM5:4181)]
zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x458f30318de051d
has expired, closing socket connection
2016-12-12 17:40:25,089 INFO  [regionserver60020.leaseChecker] regionserver.HRegionServer:
Scanner 42156 lease expired on region PostsAnalysis-2016-11-5,exam,1480313370104.4e37b0f96946a104474a8edbba4f87fd.
2016-12-12 17:40:25,193 INFO  [regionserver60020.leaseChecker] regionserver.HRegionServer:
Scanner 42155 lease expired on region PostsAnalysis-2016-11-4,exam,1480313365296.4c80cf384fcdc7bfb7c83f625f936cfe.
2016-12-12 17:40:25,194 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region
server HOST30,60020,1481556258263: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
rejected; currently processing HOST30,60020,1481556258263 as dead server
        at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing HOST30,60020,1481556258263 as dead server
        at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:284)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1087)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:912)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing
HOST30,60020,1481556258263 as dead server
        at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
        at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
        at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:8289)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1085)
        ... 2 more
2016-12-12 17:40:25,087 DEBUG [BucketCache Statistics #0] bucket.BucketCache: BucketCache
Stats: failedBlockAdditions=23240, total=4.00 GB, free=1.05 GB, usedSize=2.96 GB, cacheSize=2.54
GB, accesses=7557955, hits=6538991, IOhitsPerSecond=1432, IOTimePerHit=0.88, hitRatio=86.52%,
cachingAccesses=7525976, cachingHits=6530392, cachingHitsRatio=86.77%, evictions=779, evicted=3334211,
evictedPerRun=4280.11669921875
2016-12-12 17:40:25,304 FATAL [regionserver60020] regionserver.HRegionServer: RegionServer
abort: loaded coprocessors are: []
2016-12-12 17:40:25,116 WARN  [SplitLogWorker-HOST30,60020,1481556258263-EventThread] client.HConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, closing it. It will be recreated next time
someone needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-12-12 17:40:25,105 WARN  [ResponseProcessor for block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521
java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.139.245:50366 remote=/192.168.139.245:45710]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1998)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798)
2016-12-12 17:40:25,095 WARN  [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, closing it. It will be recreated next time
someone needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x458f30318de051c
2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn: EventThread
shut down
2016-12-12 17:40:25,095 INFO  [regionserver60020-SendThread(HOSTM1:4181)] zookeeper.ClientCnxn:
Unable to reconnect to ZooKeeper service, session 0x458f30318de051c has expired, closing socket
connection
2016-12-12 17:40:25,094 WARN  [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, closing it. It will be recreated next time
someone needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-12-12 17:40:25,306 WARN  [DataStreamer for file /Veooz-hbase/WALs/HOST30,60020,1481556258263/HOST30%2C60020%2C1481556258263.1481564167609
block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521] hdfs.DFSClient:
Error Recovery for block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521
in pipeline 192.168.139.245:45710, 192.168.212.113:45710, 192.168.195.93:45710: bad datanode
192.168.139.245:45710
2016-12-12 17:40:25,306 INFO  [SplitLogWorker-HOST30,60020,1481556258263-EventThread] client.HConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x458f30318de051d
2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x558f30318e204df
2016-12-12 17:40:25,314 INFO  [SplitLogWorker-HOST30,60020,1481556258263-EventThread] zookeeper.ClientCnxn:
EventThread shut down
2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn: EventThread
shut down
2016-12-12 17:40:25,518 INFO  [regionserver60020] regionserver.HRegionServer: STOPPED: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing HOST30,60020,1481556258263 as dead server
        at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

2016-12-12 17:40:25,518 INFO  [regionserver60020] ipc.RpcServer: Stopping server on 60020
2016-12-12 17:40:25,518 INFO  [RpcServer.listener,port=60020] ipc.RpcServer: RpcServer.listener,port=60020:
stopping
2016-12-12 17:40:25,519 DEBUG [B.DefaultRpcServer.handler=0,queue=0,port=60020] wal.FSHLog:
interrupted while waiting for notification from AsyncNotifier
2016-12-12 17:40:25,519 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped



Thanks,
Sandeep.
________________________________
From: saint.ack@gmail.com <saint.ack@gmail.com> on behalf of Stack <stack@duboce.net>
Sent: Tuesday, December 13, 2016 12:38:53 PM
To: Hbase-User
Subject: Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout

On Mon, Dec 12, 2016 at 10:22 PM, Sandeep Reddy <sandeepvreddy@outlook.com>
wrote:

> Hi,
>
>
> From past one week multiple regionservers are getting killed due to GC
> pause and zookeeper timeouts exceptions.
>
>
And previous to this week, all ran fine?

If so, do you know what changed?



> We are trying to tune GC values with different HBASE_OPTS, but no luck.
>
>
> Following is our regionserver configuration:
>
>   *   Hardware: 24GB RAM with 8 Cores
>   *
> hbase-env.sh configuration:
> export HBASE_STOP_TIMEOUT="180"
> export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:
> CMSInitiatingOccupancyFraction=70 -Xmn256m -XX:+
> UseCMSInitiatingOccupancyOnly"
> export HBASE_OPTS="$HBASE_OPTS -XX:MaxNewSize=128m -XX:NewSize=128m"
> export HBASE_REGIONSERVER_OPTS="-Xmx5g -Xms5g"
> export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=5g"
> export HBASE_MANAGES_ZK=falseexport HBASE_HEAPSIZE=4096
>
>   *   zookeeper tickTime=2000
>
>
There's a bunch that has been written on tuning GC and GC in HBase in
particular if you poke around on the net.

Can you give the server more heap? Above it says 5G. Can you go up from
there or are you up against some ceiling.

What does it say caused the Full GC? Is it always the same thing?

Do you get a heap given you run with  -XX:+HeapDumpOnOutOfMemoryError  ? If
so, open it in a profiler (JMC?). Usually the object retainer can be
plainly seen.

St.Ack






> Thanks,
> Sandeep.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message