hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 谢良 <xieli...@xiaomi.com>
Subject 答复: HBase logging paused for a long time then RS crashed
Date Thu, 03 Apr 2014 03:06:03 GMT
> 2014-04-03 08:44:14,391 WARN  [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 37907ms

What did gc log say around that timestamp? And have you enable -XX:+PrintGCApplicationStoppedTime
?
With it at least you can tell there's a long pause due to safepoint related or not(most of
them are triggered by gc, though)
And the system load is not high at that time?

Thanks,

________________________________________
发件人: Ted Yu [yuzhihong@gmail.com]
发送时间: 2014年4月3日 10:52
收件人: user@hbase.apache.org
主题: Re: HBase logging paused for a long time then RS crashed

bq. java.io.EOFException: Premature EOF: no length prefix available

Have you checked namenode log in this period (around 08:36:52) ?

bq. org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server

The long pause led to the above exception.

Are you using HDP 1.3 ?


On Wed, Apr 2, 2014 at 7:45 PM, Tao Xiao <xiaotao.cs.nju@gmail.com> wrote:

> I'm using Hortonworks HDP and my HBase cluster has 14 nodes and one of them
> crashed today (there is no job running).  I looked up the crashed region
> server's log and found that its logging paused for about 6 minutes(the log
> level is INFO), then that RS crashed.
>
> The RS's log is as follows:
>
> 2014-04-03 08:36:52,132 INFO
>  [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed]
> regionserver.HRegionServer: Post open deploy tasks for
>
> region=QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed.
> 2014-04-03 08:36:52,136 INFO
>  [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] catalog.MetaEditor:
> Updated row
>
> QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed.
> with server=b05.jsepc.com,60020,1396421585047
> *2014-04-03 08:36:52,136* INFO
>  [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed]
> regionserver.HRegionServer: Finished post open deploy task for
>
> QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed.
> *2014-04-03 08:43:13,535* INFO  [RpcServer.handler=96,port=60020]
> hdfs.DFSClient: Will fetch a new access token and retry, access token was
> invalid when connecting to /10.134.101.119:50010 :
> org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got
> access token error for OP_READ_BLOCK, self=/10.134.101.115:34507, remote=/
> 10.134.101.119:50010, for file
>
> /apps/hbase/data/data/default/E_MP_DAY_READ_20140322/b396e822aac1d59c2e52a9bb072c9d2f/info/030a5559a655442e9e86700fe53facfe,
> for pool BP-898918553-10.134.101.112-1393904898674 block 1075306210_1569446
> 2014-04-03 08:43:27,624 INFO  [RpcServer.handler=61,port=60020]
> regionserver.HRegion: writing data to region
>
> QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267.
> with WAL disabled. Data may be lost in the event of a crash.
> 2014-04-03 08:43:35,486 INFO  [RpcServer.handler=40,port=60020]
> regionserver.HRegion: writing data to region
>
> QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267.
> with WAL disabled. Data may be lost in the event of a crash.
> 2014-04-03 08:43:35,640 WARN  [Thread-16] wal.FSHLog: Couldn't find oldest
> seqNum for the region we are about to flush:
> [5c8a662eacc9e712378ea48819a74267]
> 2014-04-03 08:44:14,390 INFO  [regionserver60020-SendThread(
> a03.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 41884ms for sessionid 0x245211f9a740011, closing
> socket connection and attempting reconnect
> 2014-04-03 08:44:14,390 WARN  [regionserver60020] util.Sleeper: We slept
> 38975ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2014-04-03 08:44:14,390 WARN  [regionserver60020.compactionChecker]
> util.Sleeper: We slept 46085ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2014-04-03 08:44:14,391 WARN  [ResponseProcessor for block
> BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
> at
>
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
> 2014-04-03 08:44:14,390 INFO  [regionserver60020-SendThread(
> b05.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 41884ms for sessionid 0x545211f4de9001e, closing
> socket connection and attempting reconnect
> 2014-04-03 08:44:14,390 INFO  [regionserver60020-SendThread(
> a02.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 48443ms for sessionid 0x145211fd930001f, closing
> socket connection and attempting reconnect
> 2014-04-03 08:44:14,391 WARN  [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 37907ms
> GC pool 'ParNew' had collection(s): count=1 time=31371ms
> GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=1577ms
> 2014-04-03 08:44:14,391 WARN  [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 46094ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2014-04-03 08:44:14,416 WARN  [DataStreamer for file /apps/hbase/data/WALs/
> b05.jsepc.com,60020,1396421585047/b05.jsepc.com
> %2C60020%2C1396421585047.1396484294737
> block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647]
> hdfs.DFSClient: Error Recovery for block
> BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647 in
> pipeline 10.134.101.115:50010, 10.134.101.108:50010, 10.134.101.104:50010:
> bad datanode 10.134.101.115:50010
> 2014-04-03 08:44:14,460 WARN  [DataStreamer for file /apps/hbase/data/WALs/
> b05.jsepc.com,60020,1396421585047/b05.jsepc.com
> %2C60020%2C1396421585047.1396484294737
> block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647]
> hdfs.DFSClient: DataStreamer Exception
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on /apps/hbase/data/WALs/b05.jsepc.com,60020,1396421585047/
> b05.jsepc.com%2C60020%2C1396421585047.1396484294737: File does not exist.
> Holder DFSClient_hb_rs_b05.jsepc.com,60020,1396421585047_-1660168802_29
> does not have any open files.
> at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2764)
> at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2754)
> at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:2682)
> at
>
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:585)
> at
>
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:409)
> at
>
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59584)
> at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at $Proxy13.getAdditionalDatanode(Unknown Source)
> at
>
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:352)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at $Proxy14.getAdditionalDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:266)
> at $Proxy15.getAdditionalDatanode(Unknown Source)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:919)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
> 2014-04-03 08:44:14,494 FATAL [regionserver60020]
> regionserver.HRegionServer: ABORTING region server
> b05.jsepc.com,60020,1396421585047:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b05.jsepc.com,60020,1396421585047 as dead server
> at
>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
> at
>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b05.jsepc.com,60020,1396421585047 as dead server
> at
>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
> at
>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at
>
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
>
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
>
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832)
> at java.lang.Thread.run(Thread.java:662)
> Caused by:
>
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b05.jsepc.com,60020,1396421585047 as dead server
> at
>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
> at
>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>
> at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449)
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983)
> ... 2 more
> 2014-04-03 08:44:14,495 FATAL [regionserver60020]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
> 2014-04-03 08:44:14,525 INFO  [regionserver60020]
> regionserver.HRegionServer: STOPPED:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b05.jsepc.com,60020,1396421585047 as dead server
> at
>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
> at
>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>
>
> After the log continued at 08:43:13, the RS complained some exceptions and
> then crashed.
> The RS stopped logging from 08:36:52 to 08:43:13, is this the reason that
> caused the RS crash ? I suspect there was a full gc during this 6 minutes,
> so I looked up the GC log on this RS but did not find any full gc, neither
> did I find any gc taking longer than dozens of seconds.
>
> *I looked up HMaster's log and fount its logging also paused for a long
> time (7 minutes)*. HMaster's log is as follows:
>
> 2014-04-03 08:37:04,751 INFO  [AM.ZK.Worker-pool2-t750]
> master.RegionStates: Transitioned {6d768061a22ccd020f0b43879dd9dd3b
> state=OPENING, ts=1396485424686, server=b07.jsepc.com,60020,1396437592290}
> to {6d768061a22ccd020f0b43879dd9dd3b state=OPEN, ts=1396485424751, server=
> b07.jsepc.com,60020,1396437592290}
> *2014-04-03 08:37:04*,753 INFO  [AM.ZK.Worker-pool2-t750]
> master.RegionStates: Onlined 6d768061a22ccd020f0b43879dd9dd3b on
> b07.jsepc.com,60020,1396437592290
> *2014-04-03 08:44:26*,599 INFO  [main-EventThread]
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted,
> processing expiration [b05.jsepc.com,60020,1396421585047]
> 2014-04-03 08:44:26,900 INFO  [MASTER_SERVER_OPERATIONS-b03:60000-0]
> handler.ServerShutdownHandler: Splitting logs for
> b05.jsepc.com,60020,1396421585047
> before assignment.
> 2014-04-03 08:44:26,933 INFO  [MASTER_SERVER_OPERATIONS-b03:60000-0]
> master.SplitLogManager: dead splitlog workers [b05.jsepc.com
> ,60020,1396421585047]
> 2014-04-03 08:44:26,935 INFO  [MASTER_SERVER_OPERATIONS-b03:60000-0]
> master.SplitLogManager: started splitting 4 logs in
> [hdfs://jsepc01HA/apps/hbase/data/WALs/b05.jsepc.com
> ,60020,1396421585047-splitting]
> 2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
> task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396479046831
> acquired by a07.jsepc.com,60020,1396421598233
> 2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
> task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396484294737
> acquired by b08.jsepc.com,60020,1396421591104
> 2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
> task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396480850188
> acquired by a08.jsepc.com,60020,1396421590699
> 2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
> task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396482573001
> acquired by b09.jsepc.com,60020,1396421585070
> 2014-04-03 08:44:27,120 ERROR [RpcServer.handler=32,port=60000]
> master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a
> fatal error:
> ABORTING region server b05.jsepc.com,60020,1396421585047:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b05.jsepc.com,60020,1396421585047 as dead server
> at
>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
> at
>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>
> Cause:
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b05.jsepc.com,60020,1396421585047 as dead server
> at
>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
> at
>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at
>
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
>
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
>
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832)
> at java.lang.Thread.run(Thread.java:662)
> Caused by:
>
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b05.jsepc.com,60020,1396421585047 as dead server
> at
>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
> at
>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
> at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>
> at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449)
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983)
> ... 2 more
>
> 2014-04-03 08:44:27,702 INFO
> [b03.jsepc.com,60000,1396421381897.splitLogManagerTimeoutMonitor]
> master.SplitLogManager: total tasks = 4 unassigned = 0
> tasks={/hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396482573001=last_update
> = 1396485866944 last_version = 2 cur_worker_name =
> b09.jsepc.com,60020,1396421585070
> status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
> done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396484294737=last_update
> = 1396485866943 last_version = 2 cur_worker_name =
> b08.jsepc.com,60020,1396421591104
> status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
> done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396479046831=last_update
> = 1396485866943 last_version = 2 cur_worker_name =
> a07.jsepc.com,60020,1396421598233
> status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
> done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
> %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com
> %252C60020%252C1396421585047.1396480850188=last_update
> = 1396485866943 last_version = 2 cur_worker_name =
> a08.jsepc.com,60020,1396421590699
> status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
> done = 0 error = 0}
> 2014-04-03 08:44:27,715 ERROR [RpcServer.handler=51,port=60000]
> master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a
> fatal error:
> ABORTING region server b05.jsepc.com,60020,1396421585047:
> regionserver:60020-0x545211f4de9001e, quorum=b03.jsepc.com:2181,
> a04.jsepc.com:2181,a03.jsepc.com:2181,a02.jsepc.com:2181,
> b05.jsepc.com:2181,
> baseZNode=/hbase-unsecure regionserver:60020-0x545211f4de9001e received
> expired from ZooKeeper, aborting
> Cause:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:401)
> at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:319)
> at
>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>
>
> I looked up HMaster's gc log and did not found any gc taking longer than 10
> seconds.
>
> *Actually I looked up logs of other region servers which did not crash and
> found there were also long time pause during almost the same time.*
>
> So the phenomenon is : RS and HMaster stopped logging for several minutes
> (about 5~7 minutes) and some of the RS crashed
>
> Why would this happen?
> I changed the log level from the default DEBUG to INFO, but I don't think
> this would cause such a long time gap in logging.
>
> Appreciate for any help.
>
Mime
View raw message