hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
Date Wed, 14 Mar 2018 03:13:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398029#comment-16398029
] 

Duo Zhang commented on HBASE-20152:
-----------------------------------

{quote}
What we have now is sort of how you describe it.... where we don't do anything until meta
is deployed. We use the PE to run our RMP. Procedures know how to assign, etc.
{quote}

I think once the ProcedureExecutor is started, we will start to schedule procedures if there
are any unfinished ones before? So here, before meta is deployed, we have already started
to execute procedures...

> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> --------------------------------------------------------
>
>                 Key: HBASE-20152
>                 URL: https://issues.apache.org/jira/browse/HBASE-20152
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being assigned. Usually
they happen when a DisableTableProcedure is running concurrent with a ServerCrashProcedure.
See below. See associated HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing table delete.
Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down (Thats why
the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on region
X held
>  * SCP can't run because lock on X is held
>  * Test timesout.
> {code}
> Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 is the disable
table:
> {code}
> 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] procedure.ServerCrashProcedure(225): Done
splitting WALs pid=308, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure server=1cfd208ff882,40584,1520249102524,
splitWal=true, meta=false
> 2018-03-05 11:29:21,300 INFO  [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
> 2018-03-05 11:29:21,307 INFO  [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): Client=jenkins//172.17.0.2 list
rsgroup
> 2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): Started
disable of Group_ns:testKillRS
> 2018-03-05 11:29:21,313 INFO  [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
> 2018-03-05 11:29:21,384 INFO  [PEWorker-9] procedure2.ProcedureExecutor(1495): Initialized
subprocedures=[{pid=310, ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
> 2018-03-05 11:29:21,534 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
procedure2.ProcedureExecutor(865): Stored pid=309, state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure
table=Group_ns:testKillRS
> 2018-03-05 11:29:21,542 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,644 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,847 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
> 2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): Updated table
Group_ns:testKillRS state to DISABLING in META
> 2018-03-05 11:29:22,148 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,345 INFO  [PEWorker-5] procedure2.ProcedureExecutor(1495): Initialized
subprocedures=[{pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524}]
> 2018-03-05 11:29:22,503 INFO  [PEWorker-13] procedure.MasterProcedureScheduler(571):
pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS,
region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524, Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,503 INFO  [PEWorker-5] procedure.MasterProcedureScheduler(571): pid=310,
ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=Group_ns:testKillRS,
region=de7534c208a06502537cd95c248b3043, Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,504 INFO  [PEWorker-13] assignment.RegionStateStore(189): pid=311
updating hbase:meta row=Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.,
regionState=CLOSING, regionLocation=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5] assignment.RegionTransitionProcedure(377):
LOCK_EVENT_WAIT pid=310 serverLocks={}, namespaceLocks={}, tableLocks={}, regionLocks={{de7534c208a06502537cd95c248b3043=exclusiveLockOwner=311,
sharedLockCount=0, waitingProcCount=1}}, peerLocks={}
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5] procedure2.ProcedureExecutor(1230): LOCK_EVENT_WAIT
pid=310, ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=Group_ns:testKillRS,
region=de7534c208a06502537cd95c248b3043
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] assignment.RegionTransitionProcedure(213):
Dispatch pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS,
region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] assignment.RegionTransitionProcedure(187):
Remote call failed pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524;
rit=CLOSING, location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524
to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] assignment.UnassignProcedure(276): Expiring
server pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS,
region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
location=1cfd208ff882,40584,1520249102524, exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS,
region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> {code}
> It disable table procedure unassign and the SCP assign arrive at same time, in same ms
(2018-03-05 11:29:22,503) but the unassign manages to get the lock first.
> h3. Delete of online Regions
> Saw this in nightly failure #452 for branch-2 in TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
> {code}
>  * DisableTableProcedure is queued before SCP.
>  * DisableTableProcedure Unassign fails because can't RPC to crashed server and can't
expire.
>  * Unassign is Stuck in suspend.
>  * SCP runs and cleans up suspended Disable Unassign.
>  * SCP completes which includes assign of Disable Unassign region.
>  * Disable Unassign completes
>  * Disable completes.
>  * A scheduled Drop Table Procedure runs (its end of test).
>  * Succeeds deleting regions that are actually assigned (see above where SCP assigned
region).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message