hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
Date Thu, 08 Mar 2018 06:52:00 GMT

     [ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-20152:
--------------------------
    Description: 
Seeing a small spate of issues where disabled tables/regions are being assigned. Usually they
happen when a DisableTableProcedure is running concurrent with a ServerCrashProcedure. See
below. See associated HBASE-20131. This is umbrella issue for fixing.

h3. Deadlock
>From HBASE-20137, 'TestRSGroups is Flakey', https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325

{code}
 * SCP is running because a server was aborted in test.
 * SCP starts AssignProcedure of region X from crashed server.
 * DisableTable Procedure runs because test has finished and we're doing table delete. Queues

 * UnassignProcedure for region X.
 * Disable Unassign gets Lock on region X first.
 * SCP AssignProcedure tries to get lock, waits on lock.
 * DisableTable Procedure UnassignProcedure RPC fails because server is down (Thats why the
SCP).
 * Tries to expire the server it failed the RPC against. Fails (currently being SCP'd).
 * DisableTable Procedure Unassign is suspended. It is a suspend with lock on region X held
 * SCP can't run because lock on X is held
 * Test timesout.
{code}

h3. Delete of online Regions
Saw this in nightly failure #452 for branch-2 in TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster

{code}
 * DisableTableProcedure is queued before SCP.
 * DisableTableProcedure Unassign fails because can't RPC to crashed server and can't expire.
 * Unassign is Stuck in suspend.
 * SCP runs and cleans up suspended Disable Unassign.
 * SCP completes which includes assign of Disable Unassign region.
 * Disable Unassign completes
 * Disable completes.
 * A scheduled Drop Table Procedure runs (its end of test).
 * Succeeds deleting regions that are actually assigned (see above where SCP assigned region).
{code}

  was:
Seeing a small spate of issues where disabled tables/regions are being assigned. Usually they
happen when a DisableTableProcedure is running concurrent with a ServerCrashProcedure. See
below. See associated HBASE-20131. This is umbrella issue for fixing.

.h2 Deadlock
>From HBASE-20137, 'TestRSGroups is Flakey', https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325

{code}
 * SCP is running because a server was aborted in test.
 * SCP starts AssignProcedure of region X from crashed server.
 * DisableTable Procedure runs because test has finished and we're doing table delete. Queues

 * UnassignProcedure for region X.
 * Disable Unassign gets Lock on region X first.
 * SCP AssignProcedure tries to get lock, waits on lock.
 * DisableTable Procedure UnassignProcedure RPC fails because server is down (Thats why the
SCP).
 * Tries to expire the server it failed the RPC against. Fails (currently being SCP'd).
 * DisableTable Procedure Unassign is suspended. It is a suspend with lock on region X held
 * SCP can't run because lock on X is held
 * Test timesout.
{code}

.h2 Delete of online Regions
Saw this in nightly failure #452 for branch-2 in TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster

{code}
 * DisableTableProcedure is queued before SCP.
 * DisableTableProcedure Unassign fails because can't RPC to crashed server and can't expire.
 * Unassign is Stuck in suspend.
 * SCP runs and cleans up suspended Disable Unassign.
 * SCP completes which includes assign of Disable Unassign region.
 * Disable Unassign completes
 * Disable completes.
 * A scheduled Drop Table Procedure runs (its end of test).
 * Succeeds deleting regions that are actually assigned (see above where SCP assigned region).
{code}


> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> --------------------------------------------------------
>
>                 Key: HBASE-20152
>                 URL: https://issues.apache.org/jira/browse/HBASE-20152
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being assigned. Usually
they happen when a DisableTableProcedure is running concurrent with a ServerCrashProcedure.
See below. See associated HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing table delete.
Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down (Thats why
the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on region
X held
>  * SCP can't run because lock on X is held
>  * Test timesout.
> {code}
> h3. Delete of online Regions
> Saw this in nightly failure #452 for branch-2 in TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
> {code}
>  * DisableTableProcedure is queued before SCP.
>  * DisableTableProcedure Unassign fails because can't RPC to crashed server and can't
expire.
>  * Unassign is Stuck in suspend.
>  * SCP runs and cleans up suspended Disable Unassign.
>  * SCP completes which includes assign of Disable Unassign region.
>  * Disable Unassign completes
>  * Disable completes.
>  * A scheduled Drop Table Procedure runs (its end of test).
>  * Succeeds deleting regions that are actually assigned (see above where SCP assigned
region).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message