hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Appy (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-19335) Fix waitUntilAllRegionsAssigned
Date Thu, 23 Nov 2017 02:31:00 GMT
Appy created HBASE-19335:
----------------------------

             Summary: Fix waitUntilAllRegionsAssigned
                 Key: HBASE-19335
                 URL: https://issues.apache.org/jira/browse/HBASE-19335
             Project: HBase
          Issue Type: Bug
            Reporter: Appy
            Assignee: Appy


Found when debugging flaky test TestRegionObserverInterface#testRecovery.
In the end, the test does the following:
- Kills the RS
- Waits for all regions to be assigned
- Some validation (unrelated)
- Cleanup: delete table.
{noformat}
      cluster.killRegionServer(rs1.getRegionServer().getServerName());
      Threads.sleep(1000); // Let the kill soak in.
      util.waitUntilAllRegionsAssigned(tableName);
      LOG.info("All regions assigned");

      verifyMethodResult(SimpleRegionObserver.class,
        new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", "getCtPreWALRestore",
            "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
        tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
    } finally {
      util.deleteTable(tableName);
      table.close();
    }
  }
{noformat}

However, looking at test logs, found that we had overlapping Assigns with Unassigns. As a
result, regions ended up 'stuck in RIT' and the test timeout.
Assigns were from the ServerCrashRecovery and Unassigns were from the deleteTable cleanup.
Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) not wait until
recovery was complete.

Answer: Looks like that function is only meant for sunny scenarios but not for crashes. It
iterates over meta and just [checks for *some value* in the server column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
which is obviously present and equal to the server that was just killed.

This bug must be affecting other fault tolerance tests too and fixing it may fix more than
just one test, hopefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message