Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: <CALte62xvfSx1WjU6i+V03WqqMOPduBe4+dzL0g+q785=sLLD+g@mail.gmail.com>
References: <CAC9N05msCX+trACE+Msqt7Sg99rqnWZ3TErgfbsOMaqNchv2JA@mail.gmail.com>
 <CALte62wUi2BdGz3YLbAJt4fMAiY3fsfMsSUUsXYGruoGj_KiGg@mail.gmail.com>
 <CAAg3a2oEO125oT9MtZ3vo_5zphovEVfw+3_DARe4X2=2LdAu3Q@mail.gmail.com> <CALte62xvfSx1WjU6i+V03WqqMOPduBe4+dzL0g+q785=sLLD+g@mail.gmail.com>
From: =?UTF-8?B?UHJvdXN0KFByb3VzdC9GZW5nIEd1aXpob3UpIFtTYWlwYW5dIMKt?= <pfeng@coupang.com>
Date: Fri, 23 Sep 2016 12:31:45 +0800
Message-ID: <CAC9N05=BY1cOV+PN3JyX-L=FyaKf-OvME+wRASK-3a92opteJg@mail.gmail.com>
Subject: Re: [RegionServer Dead] Identify HBase Table Cause RegionServer
 Dead(Version 1.0.0-cdh5.5.2)
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=001a114699e880c6dc053d254153
archived-at: Fri, 23 Sep 2016 04:31:58 -0000

--001a114699e880c6dc053d254153
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks Ted and Vlad

I upload the screenshots to flickr, seems they are more clear there
1. MutateCount(Put Count) at table level(null as gap)
https://flic.kr/p/LxLa6d
<http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA=3D=3D/6631704382933867248.pn=
g>
2. MutateCount(Put Count) at table level(null as line connected)
https://flic.kr/p/LxREvK
3. MutateCount for the table with Max MemStoreSize
https://flic.kr/p/MnUkCt
4. Max Region MemStoreSize at table level
https://flic.kr/p/LxR4TP
<http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA=3D=3D/6631798940933860065.pn=
g>
5. RegionCount for each RegionServer
https://flic.kr/p/LxL9Zw
<http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A=3D=3D/6631679094166431135.pn=
g>

In the graph, there was a purple line with spikes - I assume this was not
> for the table with max MemStoreSize.
> The legend in the graph is hard to read.
> If the sizes of the Puts vary, that may explain what you observed w.r.t.
> the two graphs.
>
Purple line with spikes because the table data comes from a scheduled Spark
batch job, it have a low MemStoreSize and not happen at the problem time,
so should be clear not the problem.
From https://flic.kr/p/MnUkCt (Item 3 above) this graph, the metrics not
able to catch significant MutateCount change for this table which has max
MemStoreSize, this is why I'm not confident to say the table is the root
cause although this table does have a high IO and table size(the most big
one)

1. Tweak CMS config: -XX:CMSInitiatingOccupancyFraction=3D50
> -XX:CMSInitiatingOccupancyFractionOnly.
> Plus increase heap size accordingly to accommodate decreasing of a workin=
g
> set size (now CMS starts when 50% of heap is occupied).  If you see a lot
> of "promotion failure" messages in GC log - this may help you.
>
> There are some "promotion failed", but not that much

529874.782: [GC[YG occupancy: 725422 K (1380160 K)]529874.782: [Rescan
(parallel) , 0.0105120 secs]529874.793: [weak refs processing,
0.0031030 secs]529874.796: [scrub string table, 0.0008700 secs] [1
CMS-remark: 13341310K(15243712K)] 14066733K(16623872K), 0.0149260
secs] [Times: user=3D0.18 sys=3D0.00, real=3D0.01 secs]
529874.797: [CMS-concurrent-sweep-start]
529874.858: [GC529874.858: [ParNew: 1380160K->153344K(1380160K),
0.0406620 secs] 14721537K->13708366K(16623872K), 0.0408930 secs]
[Times: user=3D0.57 sys=3D0.01, real=3D0.04 secs]
529875.005: [GC529875.005: [ParNew: 1380160K->153344K(1380160K),
0.0388060 secs] 14922932K->13919282K(16623872K), 0.0390210 secs]
[Times: user=3D0.57 sys=3D0.02, real=3D0.04 secs]
529875.146: [GC529875.146: [ParNew: 1380160K->153344K(1380160K),
0.0388120 secs] 15099772K->14090963K(16623872K), 0.0390400 secs]
[Times: user=3D0.56 sys=3D0.01, real=3D0.04 secs]
529875.285: [GC529875.285: [ParNew: 1380160K->153344K(1380160K),
0.0368240 secs] 15211572K->14209325K(16623872K), 0.0370880 secs]
[Times: user=3D0.55 sys=3D0.01, real=3D0.04 secs]
529875.419: [GC529875.419: [ParNew: 1380160K->153344K(1380160K),
0.0392130 secs] 15403252K->14384973K(16623872K), 0.0395230 secs]
[Times: user=3D0.56 sys=3D0.00, real=3D0.04 secs]
529875.557: [GC529875.557: [ParNew: 1380160K->153344K(1380160K),
0.0373710 secs] 15554872K->14541229K(16623872K), 0.0375980 secs]
[Times: user=3D0.56 sys=3D0.01, real=3D0.04 secs]
529875.699: [GC529875.699: [ParNew (*promotion failed*):
1380160K->1380160K(1380160K), 0.0922840 secs]
15726469K->15831128K(16623872K), 0.0924960 secs] [Times: user=3D0.53
sys=3D0.00, real=3D0.10 secs]
GC locker: Trying a full collection because scavenge failed
529875.791: [Full GC529875.791: [CMS529876.249: [CMS-concurrent-sweep:
1.080/1.452 secs] [Times: user=3D15.83 sys=3D1.81, real=3D1.45 secs]
 (concurrent mode failure): 14450968K->8061155K(15243712K), 7.4250640
secs] 15831128K->8061155K(16623872K), [CMS Perm :
55346K->55346K(83968K)], 7.4252340 secs] [Times: user=3D7.43 sys=3D0.00,
real=3D7.42 secs]
529883.326: [GC529883.326: [ParNew: 1226816K->153344K(1380160K),
0.0339490 secs] 9287971K->8288344K(16623872K), 0.0341880 secs] [Times:
user=3D0.52 sys=3D0.01, real=3D0.03 secs]


> 2. Migrate to 1.8 Java and switch to GCG1. Search "tuning GCG1 for HBase"
> on the Internet
>
> Thanks for this suggestion, I believe this is the direction to move, for
now, I need identify the problem and get deeper understanding of HBase,
then could be confident to address the problem


> 3. Analyze what has triggered long GC in RS. Usually it is some long
> running M/R jobs or data import. Consider other approaches which are not =
so
> intrusive (ex. bulk load of data)
>
According to
http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why=
/,
bulk load is good at ETL, the biggest table with max MemStoreSize get data
from RabbitMQ and then Put to HBase one by one row, I'm not clear yet how
to change it to bulk load, any suggestion or I misunderstand bulk load? I
can see there is a bulkPut available in Spark-HBase module(available at 2.0
:(), is this same thing as bulk load? I need learn more seems.

For question #2, were there region splits corresponding to the increase ?
> Please check master log. You can pastebin master log if needed.
>
Below is the Master Log at the problem time, it say nothing except the
RegionServer is dead=F0=9F=98=82

2016-09-21 11:01:56,746 WARN
org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner: Found a
hfile
(hdfs://nameservice1/hbase/archive/data/prod/variable/650cefeb142f5a6558adb=
82f89151d8d/d/bd487f2dd9424fda98c35e8230982f9c)
newer than current time (1474423316746 < 1474423316747), probably a clock
skew
2016-09-21 11:46:19,786 ERROR
org.apache.hadoop.hbase.master.MasterRpcServices: Region server
fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error:
ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead serve=
r
    at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java=
:382)
    at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManag=
er.java:287)
    at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterR=
pcServices.java:287)
    at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionS=
erverStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    at java.lang.Thread.run(Thread.java:745)

Cause:
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead serve=
r
    at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java=
:382)
    at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManag=
er.java:287)
    at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterR=
pcServices.java:287)
    at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionS=
erverStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    at java.lang.Thread.run(Thread.java:745)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method=
)
    at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcce=
ssorImpl.java:57)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstru=
ctorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.=
java:106)
    at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException=
.java:95)
    at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUt=
il.java:328)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HR=
egionServer.java:1092)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:9=
00)
    at java.lang.Thread.run(Thread.java:745)
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hba=
se.YouAreDeadException):
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead serve=
r
    at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java=
:382)
    at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManag=
er.java:287)
    at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterR=
pcServices.java:287)
    at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionS=
erverStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    at java.lang.Thread.run(Thread.java:745)

    at
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1219)
    at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRp=
cClient.java:216)
    at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementat=
ion.callBlockingMethod(AbstractRpcClient.java:300)
    at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionS=
erverStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos=
.java:8289)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HR=
egionServer.java:1090)
    ... 2 more

2016-09-21 11:46:19,801 ERROR
org.apache.hadoop.hbase.master.MasterRpcServices: Region server
fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error:
ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
regionserver:60020-0x255c4a5f76332f6,
quorum=3Dfds-hadoop-prod05-mp:2181,fds-hadoop-prod07-mp:2181,fds-hadoop-pro=
d06-mp:2181,
baseZNode=3D/hbase regionserver:60020-0x255c4a5f76332f6 received expired fr=
om
ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode =3D Session expired
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeepe=
rWatcher.java:525)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher=
.java:436)
    at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:52=
2)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

2016-09-21 11:46:19,801 ERROR
org.apache.hadoop.hbase.master.MasterRpcServices: Region server
fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error:
ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: IOE in log
roller
Cause:
java.io.IOException: cannot get log writer
    at
org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvi=
der.java:365)
    at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.createWriterInstance(FSHLog=
.java:757)
    at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:722)
    at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: Parent directory doesn't exist:
/hbase/WALs/fds-hadoop-prod06-mp,60020,1473551672107
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNames=
ystem.java:2466)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNam=
esystem.java:2760)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesyst=
em.java:2674)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.=
java:2559)
    at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpc=
Server.java:592)
    at
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProt=
ocol.create(AuthorizationProviderProxyClientProtocol.java:110)
    at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslato=
rPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)
    at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNa=
menodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Prot=
obufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j=
ava:1671)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method=
)
    at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcce=
ssorImpl.java:57)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstru=
ctorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.=
java:106)
    at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException=
.java:73)
    at
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.j=
ava:1872)
    at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1737)
    at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1697)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem=
.java:449)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem=
.java:445)
    at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.=
java:81)
    at
org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(Distributed=
FileSystem.java:445)
    at
org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1121)
    at
org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1097)
    at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.init(ProtobufLog=
Writer.java:90)
    at
org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvi=
der.java:361)
    ... 4 more
Caused by:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
Parent directory doesn't exist:
/hbase/WALs/fds-hadoop-prod06-mp,60020,1473551672107
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNames=
ystem.java:2466)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNam=
esystem.java:2760)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesyst=
em.java:2674)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.=
java:2559)
    at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpc=
Server.java:592)
    at
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProt=
ocol.create(AuthorizationProviderProxyClientProtocol.java:110)
    at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslato=
rPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)
    at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNa=
menodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Prot=
obufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j=
ava:1671)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at org.apache.hadoop.ipc.Client.call(Client.java:1466)
    at org.apache.hadoop.ipc.Client.call(Client.java:1403)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.ja=
va:230)
    at com.sun.proxy.$Proxy23.create(Unknown Source)
    at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create=
(ClientNamenodeProtocolTranslatorPB.java:295)
    at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp=
l.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocat=
ionHandler.java:256)
    at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHan=
dler.java:104)
    at com.sun.proxy.$Proxy24.create(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp=
l.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279=
)
    at com.sun.proxy.$Proxy25.create(Unknown Source)
    at
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.j=
ava:1867)
    ... 14 more

Appreciate all your help
Proust


On 23 September 2016 at 01:39, Ted Yu <yuzhihong@gmail.com> wrote:

> For tuning G1GC, see the head of https://blogs.apache.org/hbase/
>
> FYI
>
> On Thu, Sep 22, 2016 at 10:30 AM, Vladimir Rodionov <
> vladrodionov@gmail.com>
> wrote:
>
> > Your RS was declared dead because of a long GC.
> >
> > What you can do:
> >
> > 1. Tweak CMS config: -XX:CMSInitiatingOccupancyFraction=3D50
> > -XX:CMSInitiatingOccupancyFractionOnly.
> > Plus increase heap size accordingly to accommodate decreasing of a
> working
> > set size (now CMS starts when 50% of heap is occupied).  If you see a l=
ot
> > of "promotion failure" messages in GC log - this may help you.
> >
> > 2. Migrate to 1.8 Java and switch to GCG1. Search "tuning GCG1 for HBas=
e"
> > on the Internet
> >
> > 3. Analyze what has triggered long GC in RS. Usually it is some long
> > running M/R jobs or data import. Consider other approaches which are no=
t
> so
> > intrusive (ex. bulk load of data)
> >
> >
> > -Vlad
> >
> > On Thu, Sep 22, 2016 at 10:20 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > bq. the MutateCount metrics didn't show there is a peak time before a=
nd
> > > after the problem on that table
> > >
> > > In the graph, there was a purple line with spikes - I assume this was
> not
> > > for the table with max MemStoreSize.
> > > The legend in the graph is hard to read.
> > > If the sizes of the Puts vary, that may explain what you observed
> w.r.t.
> > > the two graphs.
> > >
> > > For question #2, were there region splits corresponding to the
> increase ?
> > > Please check master log. You can pastebin master log if needed.
> > >
> > > For #4, if JMX doesn't respond, one option is to check region server
> log
> > -
> > > this is workaround.
> > >
> > > On Thu, Sep 22, 2016 at 9:39 AM, Proust(Proust/Feng Guizhou) [Saipan]
> =C2=AD <
> > > pfeng@coupang.com> wrote:
> > >
> > > > Hi, HBase Users
> > > >
> > > > I encounter a RegionServer Dead case, and want to identify which
> HBase
> > > > Table actually cause the problem
> > > >
> > > > Based on HBase JMX Restful Service, what I have on hand is:
> > > > 1. Table level Grafana monitoring for Put, Get, Scan, MemStoreSize
> > > > 2. RegionServer level Grafana monitoring for GC and RegionCount
> > > > 3. RegionServer logs
> > > >
> > > > *HRegionServer java process information*
> > > > /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
> > > > -XX:OnOutOfMemoryError=3Dkill -9 %p -Djava.net.preferIPv4Stack=3Dtr=
ue
> > > > -Xms34359738368 -Xmx34359738368 -XX:+UseParNewGC
> > -XX:+UseConcMarkSweepGC
> > > > -XX:CMSInitiatingOccupancyFraction=3D70 -XX:+CMSParallelRemarkEnabl=
ed
> > > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> > > > -Xloggc:/pang/log/hbase/gc-hbase.log -XX:+UseGCLogFileRotation
> > > > -XX:NumberOfGCLogFiles=3D1 -XX:GCLogFileSize=3D512M
> > > > -XX:OnOutOfMemoryError=3D/usr/
> > > > lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=3D/pang/log/
> hbase
> > > > -Dhbase.log.file=3Dhbase-cmf-hbase-REGIONSERVER-fds-hadoop-
> > > prod06-mp.log.out
> > > > -Dhbase.home.dir=3D/pang/cloudera/parcels/CDH-5.5.2-1.
> > > > cdh5.5.2.p0.4/lib/hbase
> > > > -Dhbase.id.str=3D -Dhbase.root.logger=3DWARN,RFA
> -Djava.library.path=3D/pang/
> > > > cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/lib/
> > > > native:/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/
> > > > hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=3DINFO,RFAS
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer start
> > > >
> > > > *RegionServer logs show a GC pause*
> > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for block
> > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
> > > > java.io.EOFException: Premature EOF: no length prefix available
> > > >     at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> > > > vintPrefixed(PBHelper.java:2241)
> > > >     at org.apache.hadoop.hdfs.protocol.datatransfer.
> > > > PipelineAck.readFields(
> > > > PipelineAck.java:235)
> > > >     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> > > > ResponseProcessor.run(DFSOutputStream.java:971)
> > > > 2016-09-21 11:46:19,274 WARN org.apache.hadoop.hdfs.DFSClient: Erro=
r
> > > > Recovery for block
> > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
> > > > in pipeline DatanodeInfoWithStorage[10.10.9.15:50010
> ,DS-4209cc16-e676-
> > > > 4226-be0b-1328075bbdd7,DISK],
> > > > DatanodeInfoWithStorage[10.10.9.19:50010,DS-c06678d6-3f0d-
> > > > 40a2-82e2-62387e26af7b,DISK],
> > > > DatanodeInfoWithStorage[10.10.9.21:50010,DS-7bf90b1e-7862-
> > > > 4cc1-a121-31768021a3aa,DISK]:
> > > > bad datanode DatanodeInfoWithStorage[10.10.9.15:50010
> > ,DS-4209cc16-e676-
> > > > 4226-be0b-1328075bbdd7,DISK]
> > > > 2016-09-21 11:46:19,266 WARN org.apache.hadoop.hbase.util.
> > > JvmPauseMonitor:
> > > >
> > > > *Detected pause in JVM or host machine (eg GC): pause of
> approximately
> > > > 152322msGC pool 'ParNew' had collection(s): count=3D1 time=3D465msG=
C pool
> > > > 'ConcurrentMarkSweep' had collection(s): count=3D2 time=3D152238ms*
> > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > slept
> > > > 153423ms instead of 3000ms, this is likely due to a long garbage
> > > collecting
> > > > pause and it's usually bad, see http://hbase.apache.org/book.
> > > > html#trouble.rs.runtime.zkexpired
> > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Clien=
t
> > > > session timed out, have not heard from server in 156121ms for
> sessionid
> > > > 0x255c4a5f76332f6, closing socket connection and attempting reconne=
ct
> > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Clien=
t
> > > > session timed out, have not heard from server in 153414ms for
> sessionid
> > > > 0x355c4a5f3c2469f, closing socket connection and attempting reconne=
ct
> > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Clien=
t
> > > > session timed out, have not heard from server in 156121ms for
> sessionid
> > > > 0x255c4a5f763332c, closing socket connection and attempting reconne=
ct
> > > >
> > > > *RegionServer logs show server dead:*
> > > > 2016-09-21 11:46:19,485 FATAL
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
> > > > *org.apache.hadoop.hbase.YouAreDeadException*: Server REPORT
> rejected;
> > > > currently processing fds-hadoop-prod06-mp,60020,1473551672107 as
> *dead
> > > > server*
> > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > > checkIsDead(ServerManager.java:382)
> > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > regionServerReport(
> > > > ServerManager.java:287)
> > > >     at org.apache.hadoop.hbase.master.MasterRpcServices.
> > > > regionServerReport(
> > > > MasterRpcServices.java:287)
> > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > RegionServerStatusProtos$
> > > > RegionServerStatusService$2.callBlockingMethod(
> > > > RegionServerStatusProtos.java:7912)
> > > >     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:
> 2034)
> > > >     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
> java:107)
> > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> > > > RpcExecutor.java:130)
> > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> > > java:107)
> > > >     at java.lang.Thread.run(Thread.java:745)
> > > >
> > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > Server REPORT rejected; currently processing
> > > > fds-hadoop-prod06-mp,60020,1473551672107
> > > > as dead server
> > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > > checkIsDead(ServerManager.java:382)
> > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > regionServerReport(
> > > > ServerManager.java:287)
> > > >     at org.apache.hadoop.hbase.master.MasterRpcServices.
> > > > regionServerReport(
> > > > MasterRpcServices.java:287)
> > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > RegionServerStatusProtos$
> > > > RegionServerStatusService$2.callBlockingMethod(
> > > > RegionServerStatusProtos.java:7912)
> > > >     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:
> 2034)
> > > >     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
> java:107)
> > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> > > > RpcExecutor.java:130)
> > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> > > java:107)
> > > >     at java.lang.Thread.run(Thread.java:745)
> > > >
> > > >     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Nativ=
e
> > > > Method)
> > > >     at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> > > > NativeConstructorAccessorImpl.java:57)
> > > >     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> > > > DelegatingConstructorAccessorImpl.java:45)
> > > >     at java.lang.reflect.Constructor.newInstance(Constructor.java:
> 526)
> > > >     at org.apache.hadoop.ipc.RemoteException.instantiateException(
> > > > RemoteException.java:106)
> > > >     at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> > > > RemoteException.java:95)
> > > >     at org.apache.hadoop.hbase.protobuf.ProtobufUtil.
> > getRemoteException(
> > > > ProtobufUtil.java:328)
> > > >     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> > > > tryRegionServerReport(HRegionServer.java:1092)
> > > >     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> > > > run(HRegionServer.java:900)
> > > >     at java.lang.Thread.run(Thread.java:745)
> > > > Caused by: org.apache.hadoop.hbase.ipc.
> RemoteWithExtrasException(org.
> > > > apache.hadoop.hbase.YouAreDeadException):
> > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > Server REPORT rejected; currently processing
> > > > fds-hadoop-prod06-mp,60020,1473551672107
> > > > as dead server
> > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > > checkIsDead(ServerManager.java:382)
> > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > regionServerReport(
> > > > ServerManager.java:287)
> > > >     at org.apache.hadoop.hbase.master.MasterRpcServices.
> > > > regionServerReport(
> > > > MasterRpcServices.java:287)
> > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > RegionServerStatusProtos$
> > > > RegionServerStatusService$2.callBlockingMethod(
> > > > RegionServerStatusProtos.java:7912)
> > > >     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:
> 2034)
> > > >     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
> java:107)
> > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> > > > RpcExecutor.java:130)
> > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> > > java:107)
> > > >     at java.lang.Thread.run(Thread.java:745)
> > > >
> > > >     at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(
> > > > RpcClientImpl.java:1219)
> > > >     at org.apache.hadoop.hbase.ipc.AbstractRpcClient.
> > callBlockingMethod(
> > > > AbstractRpcClient.java:216)
> > > >     at org.apache.hadoop.hbase.ipc.AbstractRpcClient$
> > > > BlockingRpcChannelImplementation.callBlockingMethod(
> > > > AbstractRpcClient.java:300)
> > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > RegionServerStatusProtos$
> > > > RegionServerStatusService$BlockingStub.regionServerReport(
> > > > RegionServerStatusProtos.java:8289)
> > > >     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> > > > tryRegionServerReport(HRegionServer.java:1090)
> > > >     ... 2 more
> > > > 2016-09-21 11:46:19,505 FATAL
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > RegionServer abort: loaded coprocessors are: [org.apache.phoenix.
> > > > coprocessor.ServerCachingEndpointImpl,
> > > > org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > > > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > > > org.apache.phoenix.
> > > > coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.
> > > > coprocessor.GroupedAggregateRegionObserver,
> > > > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > > > org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.
> > coprocessor.
> > > > SequenceRegionObserver]
> > > > 2016-09-21 11:46:19,498 FATAL org.apache.hadoop.hbase.
> > > > regionserver.LogRoller:
> > > > Aborting
> > > > java.io.IOException: cannot get log writer
> > > >
> > > >
> > > > *Error level RegionServer logs when the problem happened*2016-09-21
> > > > 11:46:19,347 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
> > > Error
> > > > syncing, request close of WAL
> > > > 2016-09-21 11:46:19,348 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,350 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,349 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,385 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:19,384 ERROR org.apache.hadoop.hbase.
> > > > regionserver.wal.FSHLog:
> > > > Error syncing, request close of WAL
> > > > 2016-09-21 11:46:20,048 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 692752
> > > > 2016-09-21 11:46:20,103 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 1831544
> > > > 2016-09-21 11:46:20,111 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 44967984
> > > > 2016-09-21 11:46:20,131 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 2334208
> > > > 2016-09-21 11:46:20,308 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 4393560
> > > > 2016-09-21 11:46:20,334 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 2905432
> > > > 2016-09-21 11:46:20,373 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 2310808
> > > > 2016-09-21 11:46:20,402 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 1112648
> > > > 2016-09-21 11:46:20,422 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 6578048
> > > > 2016-09-21 11:46:20,437 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 59242936
> > > > 2016-09-21 11:46:20,608 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 4482224
> > > > 2016-09-21 11:46:20,624 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 20144
> > > > 2016-09-21 11:46:20,652 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 23568
> > > > 2016-09-21 11:46:20,687 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 14395336
> > > > 2016-09-21 11:46:20,706 ERROR org.apache.hadoop.hbase.
> > > > regionserver.HRegion:
> > > > Memstore size is 14025824
> > > >
> > > > Attached some metrics screenshots catched when the problem happen
> > > > All the metrics have gap because the RegionServer not able to handl=
e
> > /jmx
> > > > requests at all
> > > >
> > > > 1. One screenshot for MutateCount(Put Count) at table level
> > > > http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA=3D=3D/
> > 6631704382933867248.png
> > > > 2. One screenshot for RegionServer level GC Time
> > > > http://img0.ph.126.net/ouciVyCvpqEkMPV3phnx5A=3D=3D/
> > 6631661501980397103.png
> > > > 3. One screenshot for Max Region MemStoreSize at table level
> > > > http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA=3D=3D/
> > 6631798940933860065.png
> > > > 4. One screenshot for RegionCount for each RegionServer
> > > > http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A=3D=3D/
> > 6631679094166431135.png
> > > >
> > > > By the way, all the legends in the screenshots are ordered by max
> > value,
> > > so
> > > > the metric series with max value show at last
> > > >
> > > > With all those information, what I get now is:
> > > > 1. A lot of tables have sudden Put increase after the RegionServer
> back
> > > to
> > > > service, my understanding is data source streaming not able to
> process
> > > all
> > > > the data when the problem happened, so the data accumulated about t=
wo
> > > times
> > > > at a later time, those tables seems not the root cause
> > > > 2. From the GC metrics, it is pretty easy to identify which region
> > server
> > > > have the problem and all the logs catced from the problem server
> > > > 3. All the region server have a sudden region count increase and
> > decrease
> > > > except the dead one
> > > > 4. From the Max MemStoreSize metric sceenshot, I prefer to think th=
e
> > > table
> > > > with max MemStoreSize is the root cause, but the MutateCount metric=
s
> > > didn't
> > > > show there is a peak time before and after the problem on that tabl=
e,
> > so
> > > > not confident about this judgement
> > > >
> > > > I need help on
> > > > 1. Identify which table cause the RegionServer dead
> > > > 2. Why RegionCount have a sudden increase and decrease for all the
> > > servers
> > > > not dead
> > > > 3. Any other metrics missed but important to debug some Hotspotting
> > cases
> > > > 4. How I can get merics when the RegionServer not able to handle JM=
X
> > > > Restful service
> > > >
> > > > Thanks a lot in advance and please let me know if need any other
> > > > information
> > > >
> > >
> >
>


--=20

=E5=9C=B0=E5=9D=80: =E4=B8=8A=E6=B5=B7=E5=B8=82=E6=B5=A6=E4=B8=9C=E6=96=B0=
=E5=8C=BA=E9=87=91=E7=A7=91=E8=B7=AF2889=E5=BC=84=E9=95=BF=E6=B3=B0=E5=B9=
=BF=E5=9C=BAC=E5=BA=A77=E6=A5=BC
Address: 7th floor, No#3, Jinke road 2889, Pudong district, Shanghai, China=
.
Mobile: +86 13621672634

--001a114699e880c6dc053d254153--