hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Appy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19335) Fix waitUntilAllRegionsAssigned
Date Tue, 28 Nov 2017 00:01:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267825#comment-16267825

Appy commented on HBASE-19335:

We can't clear the list.
Chia-Ping had the same question on RB. Here's a copy for quick ref:
Should we remove the rs from dead list when test try to restart the rs? IIRC, the IT has the
action which would kill-and-start RS.

That's a nice question! Keep them coming.
I considered the case and there is one reason why we can't do that, and one more reason why
it's not needed.
- We track dead RS to ignore it's presence in meta table. We can't remove a dead RS from list
because even if the RS hase been restarted, it doesn't mean the regions that were on it have
been reassigned. Maybe they are still undergoing log splitting.

- We don't need to "clean up" the list if a RS has been restarted because ServerName (domain
+ port + startcode) will be different for the new instance i.e. a dead ServerName will always
remain dead. Another thing could be memory used by list, but i suspect it'll ever grow too
big to be any issue, and it's tests, so it's fine.

Let me mention some of this as comment in the code so others don't trip on same thing.

> Fix waitUntilAllRegionsAssigned
> -------------------------------
>                 Key: HBASE-19335
>                 URL: https://issues.apache.org/jira/browse/HBASE-19335
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: HBASE-19335.master.001.patch, HBASE-19335.master.002.patch
> Found when debugging flaky test TestRegionObserverInterface#testRecovery.
> In the end, the test does the following:
> - Kills the RS
> - Waits for all regions to be assigned
> - Some validation (unrelated)
> - Cleanup: delete table.
> {noformat}
>       cluster.killRegionServer(rs1.getRegionServer().getServerName());
>       Threads.sleep(1000); // Let the kill soak in.
>       util.waitUntilAllRegionsAssigned(tableName);
>       LOG.info("All regions assigned");
>       verifyMethodResult(SimpleRegionObserver.class,
>         new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", "getCtPreWALRestore",
>             "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
>         tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
>     } finally {
>       util.deleteTable(tableName);
>       table.close();
>     }
>   }
> {noformat}
> However, looking at test logs, found that we had overlapping Assigns with Unassigns.
As a result, regions ended up 'stuck in RIT' and the test timeout.
> Assigns were from the ServerCrashRecovery and Unassigns were from the deleteTable cleanup.
> Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) not wait
until recovery was complete.
> Answer: Looks like that function is only meant for sunny scenarios but not for crashes.
It iterates over meta and just [checks for *some value* in the server column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
which is obviously present and equal to the server that was just killed.
> This bug must be affecting other fault tolerance tests too and fixing it may fix more
than just one test, hopefully.

This message was sent by Atlassian JIRA

View raw message