hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Umesh Agashe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18366) Fix flaky test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta
Date Thu, 13 Jul 2017 21:56:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086471#comment-16086471
] 

Umesh Agashe commented on HBASE-18366:
--------------------------------------

Hi [~stack], [~yangzhe1991]:

FWICS here is the root cause:

The UT tests ServerCrashProcedure when RS carrying meta region crashes. It also simulates
master crash after executing each step in the procedure.

Initially all RS are at the same version i.e. 3.0.0-SNAPSHOT. HMaster.getRegionServerVersion()
returns version 0.0.0 for dead RS (carrying meta). This makes AssignmentManager.getExcludedServersForSystemTable()
return non-empty list and the logic in AssignmentManager.checkIfShouldMoveSystemRegionAsync()
is triggered which in turn submits MoveRegionProcedure to move meta region from RS with version
0.0.0 to one of other RS with latest version.

As commented before this causes race condition between scan and MoveRegionProcedure.

AssignmentManager.getExcludedServersForSystemTable() uses master.getServerManager().getOnlineServersList()
to get list of online servers only. But on further scrutiny of code and logs I found that
server can be online and dead at the same time!

IMO, 
* Currently meta is re/assigned from ServerCrashProcedure, during master initialization from
MasterMetaBootstrap and followed by in checkIfShouldMoveSystemRegionAsync().
* that means meta re/assignment may be attempted at max 3 times in certain conditions.
* I am working on HBASE-18261 to have meta recovery/ assignment logic at one place.
* I think we can pull these changes for assigning meta to RS with highest version number there.
* This will result in, RS with highest version number will be considered for meta region assignment
when:
# When meta region carrying RS crashes
# During master startup

Along with above changes, obviously we need to fix ServerManager.isServerOnline() and ServerManager.isServerDead()
returning true at the same time. This could be result of test code simulating crash but the
class itself should not allow this case (IMHO).

I have a following fix ready (and tested) which will fix the test but I don't consider it
a long term fix.
{code}
diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 046612a..1a2d53b 100644
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1760,6 +1760,7 @@ public class AssignmentManager implements ServerListener {
   public List<ServerName> getExcludedServersForSystemTable() {
     List<Pair<ServerName, String>> serverList = master.getServerManager().getOnlineServersList()
         .stream()
+        .filter((s)->!master.getServerManager().isServerDead(s))
         .map((s)->new Pair<>(s, master.getRegionServerVersion(s)))
         .collect(Collectors.toList());
     if (serverList.isEmpty()) {
{code}

[~stack], as you have suggested, we can disable the test for now. When we agree on fix, we
can enable it. Let me know your thoughts. Thanks!

> Fix flaky test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18366
>                 URL: https://issues.apache.org/jira/browse/HBASE-18366
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Umesh Agashe
>            Assignee: Umesh Agashe
>
> It worked for a few days after enabling it with HBASE-18278. But started failing after
commits:
> 6786b2b
> 68436c9
> 75d2eca
> 50bb045
> df93c13
> It works with one commit before: c5abb6c. Need to see what changed with those commits.
> Currently it fails with TableNotFoundException.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message