hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Umesh Agashe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18261) [AMv2] Create new RecoverMetaProcedure and use it from ServerCrashProcedure and HMaster.finishActiveMasterInitialization()
Date Thu, 13 Jul 2017 21:55:02 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086469#comment-16086469
] 

Umesh Agashe commented on HBASE-18261:
--------------------------------------

Hi [~stack], [~yangzhe1991]:

FWICS here is the root cause:

The UT tests ServerCrashProcedure when RS carrying meta region crashes. It also simulates
master crash after executing each step in the procedure.

Initially all RS are at the same version i.e. 3.0.0-SNAPSHOT. HMaster.getRegionServerVersion()
returns version 0.0.0 for dead RS (carrying meta). This makes AssignmentManager.getExcludedServersForSystemTable()
return non-empty list and the logic in AssignmentManager.checkIfShouldMoveSystemRegionAsync()
is triggered which in turn submits MoveRegionProcedure to move meta region from RS with version
0.0.0 to one of other RS with latest version.

As commented before this causes race condition between scan and MoveRegionProcedure.

AssignmentManager.getExcludedServersForSystemTable() uses master.getServerManager().getOnlineServersList()
to get list of online servers only. But on further scrutiny of code and logs I found that
server can be online and dead at the same time!

IMO, 
* Currently meta is re/assigned from ServerCrashProcedure, during master initialization from
MasterMetaBootstrap and followed by in checkIfShouldMoveSystemRegionAsync().
* that means meta re/assignment may be attempted at max 3 times in certain conditions.
* I am working on HBASE-18261 to have meta recovery/ assignment logic at one place.
* I think we can pull these changes for assigning meta to RS with highest version number there.
* This will result in, RS with highest version number will be considered for meta region assignment
when:
# When meta region carrying RS crashes
# During master startup

Along with above changes, obviously we need to fix ServerManager.isServerOnline() and ServerManager.isServerDead()
returning true at the same time. This could be result of test code simulating crash but the
class itself should not allow this case (IMHO).

I have a following fix ready (and tested) which will fix the test but I don't consider it
a long term fix.
{code}
diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 046612a..1a2d53b 100644
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1760,6 +1760,7 @@ public class AssignmentManager implements ServerListener {
   public List<ServerName> getExcludedServersForSystemTable() {
     List<Pair<ServerName, String>> serverList = master.getServerManager().getOnlineServersList()
         .stream()
+        .filter((s)->!master.getServerManager().isServerDead(s))
         .map((s)->new Pair<>(s, master.getRegionServerVersion(s)))
         .collect(Collectors.toList());
     if (serverList.isEmpty()) {
{code}

[~stack], as you have suggested, we can disable the test for now. When we agree on fix, we
can enable it. Let me know your thoughts. Thanks!

> [AMv2] Create new RecoverMetaProcedure and use it from ServerCrashProcedure and HMaster.finishActiveMasterInitialization()
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18261
>                 URL: https://issues.apache.org/jira/browse/HBASE-18261
>             Project: HBase
>          Issue Type: Improvement
>          Components: amv2
>    Affects Versions: 2.0.0-alpha-1
>            Reporter: Umesh Agashe
>            Assignee: Umesh Agashe
>             Fix For: 2.0.0-alpha-2
>
>         Attachments: HBASE-18261.master.001.patch
>
>
> When unit test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()
is enabled and run several times, it fails intermittently. Cause is meta recovery is done
at two different places:
> * ServerCrashProcedure.processMeta()
> * HMaster.finishActiveMasterInitialization()
> and its not coordinated.
> When HMaster.finishActiveMasterInitialization() gets to submit splitMetaLog() first and
while its running call from ServerCrashProcedure.processMeta() fails causing step to be retried
again in a loop.
> When ServerCrashProcedure.processMeta() submits splitMetaLog after splitMetaLog from
HMaster.finishActiveMasterInitialization() is finished, success is returned without doing
any work.
> But if ServerCrashProcedure.processMeta() submits splitMetaLog request and while its
going HMaster.finishActiveMasterInitialization() submits it test fails with exception.
> [~stack] and I discussed the possible solution:
> Create RecoverMetaProcedure and call it where required. Procedure framework provides
mutual exclusion and requires idempotence, which should fix the problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message