hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Appy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19335) Fix waitUntilAllRegionsAssigned
Date Thu, 23 Nov 2017 02:32:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263701#comment-16263701
] 

Appy commented on HBASE-19335:
------------------------------

I have a patch which i'll upload shortly.

> Fix waitUntilAllRegionsAssigned
> -------------------------------
>
>                 Key: HBASE-19335
>                 URL: https://issues.apache.org/jira/browse/HBASE-19335
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>
> Found when debugging flaky test TestRegionObserverInterface#testRecovery.
> In the end, the test does the following:
> - Kills the RS
> - Waits for all regions to be assigned
> - Some validation (unrelated)
> - Cleanup: delete table.
> {noformat}
>       cluster.killRegionServer(rs1.getRegionServer().getServerName());
>       Threads.sleep(1000); // Let the kill soak in.
>       util.waitUntilAllRegionsAssigned(tableName);
>       LOG.info("All regions assigned");
>       verifyMethodResult(SimpleRegionObserver.class,
>         new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", "getCtPreWALRestore",
>             "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
>         tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
>     } finally {
>       util.deleteTable(tableName);
>       table.close();
>     }
>   }
> {noformat}
> However, looking at test logs, found that we had overlapping Assigns with Unassigns.
As a result, regions ended up 'stuck in RIT' and the test timeout.
> Assigns were from the ServerCrashRecovery and Unassigns were from the deleteTable cleanup.
> Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) not wait
until recovery was complete.
> Answer: Looks like that function is only meant for sunny scenarios but not for crashes.
It iterates over meta and just [checks for *some value* in the server column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
which is obviously present and equal to the server that was just killed.
> This bug must be affecting other fault tolerance tests too and fixing it may fix more
than just one test, hopefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message