hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
Date Thu, 10 Aug 2017 21:33:00 GMT

     [ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

stack reassigned HBASE-18551:

    Assignee: stack

> [AMv2] UnassignProcedure and crashed regionservers
> --------------------------------------------------
>                 Key: HBASE-18551
>                 URL: https://issues.apache.org/jira/browse/HBASE-18551
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>         Attachments: HBASE-18551.master.001.patch
> This has been [~uagashe] and my obsession over the last few days, what should an UnassignProcedure
do when it dispatches a CLOSE but the CLOSE fails because of ConnectException or SocketTimeout.
> + We used to let UnassignProcedure continue presuming the Region would be closed since
the server is dead. BUT, if the unassign was part of a MoveProcedure, the unassign would proceed
and the Move would then run WITHOUT first splitting logs. Bad.
> + So, we made it so UnassignProcedure failed; let the upper layers take care of the failure.
See HBASE-18491 that enabled this behavior. BUT, we are since figuring that even if the UP
completes as a failure, since it gives up the Region lock on completion, another procedure
-- say an AssignProcedure -- could cut in before the ServerCrashProcedure had finished and
again there could be dataloss.
> + Now we are thinking the UP should hold on to the Region lock until we are signalled
by a ServerCrashProcedure; only then let go of the region. The UP has context that is hard
to pass another. Waiting on a SCP has the UP living on for what could be a good amount of
time. It might be ok if we can suspend the procedure.
> There is a good sample scenario that came up doing the no-regions-on-master issue, HBASE-18511.
When meta is not on master, TestSplitTransactionOnCluster is failing. It fails because though
the test completes, the tests commonly kill a RegionServer. The teardown for the test runs
before we've noticed the aborted RS. So, the disable of the table in the teardown prepartory
to our deleting the test table as part of clean up, goes to unassign regions but the unassign
fails against the aborted server.
> Good stuff.

This message was sent by Atlassian JIRA

View raw message