Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 479FF1088C for ; Thu, 3 Apr 2014 02:53:24 +0000 (UTC) Received: (qmail 7213 invoked by uid 500); 3 Apr 2014 02:53:21 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 6898 invoked by uid 500); 3 Apr 2014 02:53:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 6881 invoked by uid 99); 3 Apr 2014 02:53:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 02:53:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.160.181 as permitted sender) Received: from [209.85.160.181] (HELO mail-yk0-f181.google.com) (209.85.160.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 02:53:12 +0000 Received: by mail-yk0-f181.google.com with SMTP id 131so1009785ykp.40 for ; Wed, 02 Apr 2014 19:52:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=gMmjMewH42oT9Fj60PQTr2oeCvm8RMTaI95ocGfLPlg=; b=O0Ms64ZBgb+yU6k8wMVZR2zvWK1GC5RwGjmEb0LtS09+jRJ5cxFnh+PWZzhJPGDami RkagHjj5/FtpPn3IyszHY3G+ADIQbZNz0kcPhY1gy4Yvvhib/XM9QTEjrLXM7pv/gmAs E0Nr/RQo1f1I+YDbJVROeSdlgjUAiTv9gQXlqv5/p1YTw/sxvUpSfy2RoN7sfr4eeuip ntrrfSSvQmzVSithzArQPSaaHRwgzZXVlO/65H1RnaMVuTAZgr/UsZC/nBKfx8hPEpge M3bziIp8Ag25y/U00fRdUKUZ1jfa8SfX8svnM4sogSFFikesDv1JsFW/wsyZX3DqquvQ PmaA== MIME-Version: 1.0 X-Received: by 10.236.8.68 with SMTP id 44mr5376129yhq.39.1396493570129; Wed, 02 Apr 2014 19:52:50 -0700 (PDT) Received: by 10.170.79.130 with HTTP; Wed, 2 Apr 2014 19:52:50 -0700 (PDT) In-Reply-To: References: Date: Wed, 2 Apr 2014 19:52:50 -0700 Message-ID: Subject: Re: HBase logging paused for a long time then RS crashed From: Ted Yu To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=089e01634e402dc4de04f61a7fec X-Virus-Checked: Checked by ClamAV on apache.org --089e01634e402dc4de04f61a7fec Content-Type: text/plain; charset=ISO-8859-1 bq. java.io.EOFException: Premature EOF: no length prefix available Have you checked namenode log in this period (around 08:36:52) ? bq. org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server The long pause led to the above exception. Are you using HDP 1.3 ? On Wed, Apr 2, 2014 at 7:45 PM, Tao Xiao wrote: > I'm using Hortonworks HDP and my HBase cluster has 14 nodes and one of them > crashed today (there is no job running). I looked up the crashed region > server's log and found that its logging paused for about 6 minutes(the log > level is INFO), then that RS crashed. > > The RS's log is as follows: > > 2014-04-03 08:36:52,132 INFO > [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] > regionserver.HRegionServer: Post open deploy tasks for > > region=QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed. > 2014-04-03 08:36:52,136 INFO > [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] catalog.MetaEditor: > Updated row > > QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed. > with server=b05.jsepc.com,60020,1396421585047 > *2014-04-03 08:36:52,136* INFO > [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] > regionserver.HRegionServer: Finished post open deploy task for > > QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed. > *2014-04-03 08:43:13,535* INFO [RpcServer.handler=96,port=60020] > hdfs.DFSClient: Will fetch a new access token and retry, access token was > invalid when connecting to /10.134.101.119:50010 : > org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got > access token error for OP_READ_BLOCK, self=/10.134.101.115:34507, remote=/ > 10.134.101.119:50010, for file > > /apps/hbase/data/data/default/E_MP_DAY_READ_20140322/b396e822aac1d59c2e52a9bb072c9d2f/info/030a5559a655442e9e86700fe53facfe, > for pool BP-898918553-10.134.101.112-1393904898674 block 1075306210_1569446 > 2014-04-03 08:43:27,624 INFO [RpcServer.handler=61,port=60020] > regionserver.HRegion: writing data to region > > QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267. > with WAL disabled. Data may be lost in the event of a crash. > 2014-04-03 08:43:35,486 INFO [RpcServer.handler=40,port=60020] > regionserver.HRegion: writing data to region > > QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267. > with WAL disabled. Data may be lost in the event of a crash. > 2014-04-03 08:43:35,640 WARN [Thread-16] wal.FSHLog: Couldn't find oldest > seqNum for the region we are about to flush: > [5c8a662eacc9e712378ea48819a74267] > 2014-04-03 08:44:14,390 INFO [regionserver60020-SendThread( > a03.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have > not heard from server in 41884ms for sessionid 0x245211f9a740011, closing > socket connection and attempting reconnect > 2014-04-03 08:44:14,390 WARN [regionserver60020] util.Sleeper: We slept > 38975ms instead of 3000ms, this is likely due to a long garbage collecting > pause and it's usually bad, see > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > 2014-04-03 08:44:14,390 WARN [regionserver60020.compactionChecker] > util.Sleeper: We slept 46085ms instead of 10000ms, this is likely due to a > long garbage collecting pause and it's usually bad, see > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > 2014-04-03 08:44:14,391 WARN [ResponseProcessor for block > BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647] > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647 > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492) > at > > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116) > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721) > 2014-04-03 08:44:14,390 INFO [regionserver60020-SendThread( > b05.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have > not heard from server in 41884ms for sessionid 0x545211f4de9001e, closing > socket connection and attempting reconnect > 2014-04-03 08:44:14,390 INFO [regionserver60020-SendThread( > a02.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have > not heard from server in 48443ms for sessionid 0x145211fd930001f, closing > socket connection and attempting reconnect > 2014-04-03 08:44:14,391 WARN [JvmPauseMonitor] util.JvmPauseMonitor: > Detected pause in JVM or host machine (eg GC): pause of approximately > 37907ms > GC pool 'ParNew' had collection(s): count=1 time=31371ms > GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=1577ms > 2014-04-03 08:44:14,391 WARN [regionserver60020.periodicFlusher] > util.Sleeper: We slept 46094ms instead of 10000ms, this is likely due to a > long garbage collecting pause and it's usually bad, see > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > 2014-04-03 08:44:14,416 WARN [DataStreamer for file /apps/hbase/data/WALs/ > b05.jsepc.com,60020,1396421585047/b05.jsepc.com > %2C60020%2C1396421585047.1396484294737 > block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647] > hdfs.DFSClient: Error Recovery for block > BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647 in > pipeline 10.134.101.115:50010, 10.134.101.108:50010, 10.134.101.104:50010: > bad datanode 10.134.101.115:50010 > 2014-04-03 08:44:14,460 WARN [DataStreamer for file /apps/hbase/data/WALs/ > b05.jsepc.com,60020,1396421585047/b05.jsepc.com > %2C60020%2C1396421585047.1396484294737 > block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647] > hdfs.DFSClient: DataStreamer Exception > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on /apps/hbase/data/WALs/b05.jsepc.com,60020,1396421585047/ > b05.jsepc.com%2C60020%2C1396421585047.1396484294737: File does not exist. > Holder DFSClient_hb_rs_b05.jsepc.com,60020,1396421585047_-1660168802_29 > does not have any open files. > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2764) > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2754) > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:2682) > at > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:585) > at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:409) > at > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59584) > at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047) > > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at $Proxy13.getAdditionalDatanode(Unknown Source) > at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:352) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at $Proxy14.getAdditionalDatanode(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:266) > at $Proxy15.getAdditionalDatanode(Unknown Source) > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:919) > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031) > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823) > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475) > 2014-04-03 08:44:14,494 FATAL [regionserver60020] > regionserver.HRegionServer: ABORTING region server > b05.jsepc.com,60020,1396421585047: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b05.jsepc.com,60020,1396421585047 as dead server > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b05.jsepc.com,60020,1396421585047 as dead server > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > at > > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832) > at java.lang.Thread.run(Thread.java:662) > Caused by: > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b05.jsepc.com,60020,1396421585047 as dead server > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449) > at > > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653) > at > > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983) > ... 2 more > 2014-04-03 08:44:14,495 FATAL [regionserver60020] > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] > 2014-04-03 08:44:14,525 INFO [regionserver60020] > regionserver.HRegionServer: STOPPED: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b05.jsepc.com,60020,1396421585047 as dead server > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > > After the log continued at 08:43:13, the RS complained some exceptions and > then crashed. > The RS stopped logging from 08:36:52 to 08:43:13, is this the reason that > caused the RS crash ? I suspect there was a full gc during this 6 minutes, > so I looked up the GC log on this RS but did not find any full gc, neither > did I find any gc taking longer than dozens of seconds. > > *I looked up HMaster's log and fount its logging also paused for a long > time (7 minutes)*. HMaster's log is as follows: > > 2014-04-03 08:37:04,751 INFO [AM.ZK.Worker-pool2-t750] > master.RegionStates: Transitioned {6d768061a22ccd020f0b43879dd9dd3b > state=OPENING, ts=1396485424686, server=b07.jsepc.com,60020,1396437592290} > to {6d768061a22ccd020f0b43879dd9dd3b state=OPEN, ts=1396485424751, server= > b07.jsepc.com,60020,1396437592290} > *2014-04-03 08:37:04*,753 INFO [AM.ZK.Worker-pool2-t750] > master.RegionStates: Onlined 6d768061a22ccd020f0b43879dd9dd3b on > b07.jsepc.com,60020,1396437592290 > *2014-04-03 08:44:26*,599 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [b05.jsepc.com,60020,1396421585047] > 2014-04-03 08:44:26,900 INFO [MASTER_SERVER_OPERATIONS-b03:60000-0] > handler.ServerShutdownHandler: Splitting logs for > b05.jsepc.com,60020,1396421585047 > before assignment. > 2014-04-03 08:44:26,933 INFO [MASTER_SERVER_OPERATIONS-b03:60000-0] > master.SplitLogManager: dead splitlog workers [b05.jsepc.com > ,60020,1396421585047] > 2014-04-03 08:44:26,935 INFO [MASTER_SERVER_OPERATIONS-b03:60000-0] > master.SplitLogManager: started splitting 4 logs in > [hdfs://jsepc01HA/apps/hbase/data/WALs/b05.jsepc.com > ,60020,1396421585047-splitting] > 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: > task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396479046831 > acquired by a07.jsepc.com,60020,1396421598233 > 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: > task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396484294737 > acquired by b08.jsepc.com,60020,1396421591104 > 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: > task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396480850188 > acquired by a08.jsepc.com,60020,1396421590699 > 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: > task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396482573001 > acquired by b09.jsepc.com,60020,1396421585070 > 2014-04-03 08:44:27,120 ERROR [RpcServer.handler=32,port=60000] > master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a > fatal error: > ABORTING region server b05.jsepc.com,60020,1396421585047: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b05.jsepc.com,60020,1396421585047 as dead server > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > Cause: > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b05.jsepc.com,60020,1396421585047 as dead server > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > at > > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832) > at java.lang.Thread.run(Thread.java:662) > Caused by: > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b05.jsepc.com,60020,1396421585047 as dead server > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449) > at > > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653) > at > > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711) > at > > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983) > ... 2 more > > 2014-04-03 08:44:27,702 INFO > [b03.jsepc.com,60000,1396421381897.splitLogManagerTimeoutMonitor] > master.SplitLogManager: total tasks = 4 unassigned = 0 > tasks={/hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396482573001=last_update > = 1396485866944 last_version = 2 cur_worker_name = > b09.jsepc.com,60020,1396421585070 > status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 > done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396484294737=last_update > = 1396485866943 last_version = 2 cur_worker_name = > b08.jsepc.com,60020,1396421591104 > status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 > done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396479046831=last_update > = 1396485866943 last_version = 2 cur_worker_name = > a07.jsepc.com,60020,1396421598233 > status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 > done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com > %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com > %252C60020%252C1396421585047.1396480850188=last_update > = 1396485866943 last_version = 2 cur_worker_name = > a08.jsepc.com,60020,1396421590699 > status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 > done = 0 error = 0} > 2014-04-03 08:44:27,715 ERROR [RpcServer.handler=51,port=60000] > master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a > fatal error: > ABORTING region server b05.jsepc.com,60020,1396421585047: > regionserver:60020-0x545211f4de9001e, quorum=b03.jsepc.com:2181, > a04.jsepc.com:2181,a03.jsepc.com:2181,a02.jsepc.com:2181, > b05.jsepc.com:2181, > baseZNode=/hbase-unsecure regionserver:60020-0x545211f4de9001e received > expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:401) > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:319) > at > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > > I looked up HMaster's gc log and did not found any gc taking longer than 10 > seconds. > > *Actually I looked up logs of other region servers which did not crash and > found there were also long time pause during almost the same time.* > > So the phenomenon is : RS and HMaster stopped logging for several minutes > (about 5~7 minutes) and some of the RS crashed > > Why would this happen? > I changed the log level from the default DEBUG to INFO, but I don't think > this would cause such a long time gap in logging. > > Appreciate for any help. > --089e01634e402dc4de04f61a7fec--