hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-19710) hbase:namespace table was stuck in transition
Date Fri, 05 Jan 2018 00:09:00 GMT
Ted Yu created HBASE-19710:
------------------------------

             Summary: hbase:namespace table was stuck in transition
                 Key: HBASE-19710
                 URL: https://issues.apache.org/jira/browse/HBASE-19710
             Project: HBase
          Issue Type: Bug
            Reporter: Ted Yu
            Priority: Critical


ITBLL with chaos monkey failed due to namespace table getting stuck in transition.

>From hbase-hbase-master-ctr-e137-1514896590304-3629-01-000006.hwx.site.log , we can see
that master closed namespace table on 000009:
{code}
2018-01-04 17:24:35,067 DEBUG [main-EventThread] zookeeper.ZKWatcher: master:20000-0x160c222710c0028,
quorum=ctr-e137-1514896590304-3629-01-000011.hwx.site:2181,ctr-e137-      1514896590304-3629-01-000014.hwx.site:2181,ctr-e137-1514896590304-3629-01-000009.hwx.site:2181,ctr-e137-1514896590304-3629-01-000006.hwx.site:2181,ctr-e137-1514896590304-3629-
01-000003.hwx.site:2181,ctr-e137-1514896590304-3629-01-000007.hwx.site:2181,ctr-e137-1514896590304-3629-01-000013.hwx.site:2181,ctr-e137-1514896590304-3629-01-000002.hwx.site:
2181,ctr-e137-1514896590304-3629-01-000012.hwx.site:2181,ctr-e137-1514896590304-3629-01-000008.hwx.site:2181,ctr-e137-1514896590304-3629-01-000010.hwx.site:2181,
baseZNode=/   hbase-unsecure Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected,
path=/hbase-unsecure/rs
2018-01-04 17:24:35,067 INFO  [ProcExecWrkr-5] assignment.RegionStateStore: pid=643 updating
hbase:meta row=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.,   regionState=CLOSING,
regionLocation=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872
...
2018-01-04 17:24:35,246 INFO  [ProcExecWrkr-12] procedure.MasterProcedureScheduler: pid=647,
ppid=642, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:     namespace,
region=a95ed2d7434a43390fbec73abeeb9fd9 hbase:namespace hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.

2018-01-04 17:25:17,041 DEBUG [ctr-e137-1514896590304-3629-01-000006:20000.masterManager]
procedure2.ProcedureExecutor: Loading pid=641, state=WAITING:MOVE_REGION_ASSIGN;      MoveRegionProcedure
hri=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9., source=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872,
           destination=
{code}

For the move operation, from ctr-e137-1514896590304-3629-01-000009.hwx.site log:
{code}
2018-01-04 17:24:34,855 DEBUG [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0]
coprocessor.CoprocessorHost: Stop coprocessor org.apache.hadoop.hbase.security.   access.SecureBulkLoadEndpoint
2018-01-04 17:24:34,855 INFO  [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0]
regionserver.HRegion: Closed hbase:namespace,,1515085217343.                      a95ed2d7434a43390fbec73abeeb9fd9.
2018-01-04 17:24:34,856 DEBUG [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0]
handler.CloseRegionHandler: Closed hbase:namespace,,1515085217343.                a95ed2d7434a43390fbec73abeeb9fd9.
...
2018-01-04 17:25:47,607 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=18,queue=0,port=16020]
ipc.RpcServer: callId: 16 service: ClientService methodName: Get size: 103           connection:
172.27.13.80:36738 deadline: 1515086837568
org.apache.hadoop.hbase.NotServingRegionException: hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.
is not online on ctr-e137-1514896590304-3629-01-000009.hwx. site,16020,1515086719163
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3312)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3289)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1354)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2360)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:403)
{code}
We can see that the region server was not serving the region.

After that, the masters kept thinking namespace table was on 0009, leading to prolonged downtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message