hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7985) TestMasterFailover.testMasterFailoverWithMockedRITOnDeadRS fails frequently in 0.94
Date Mon, 04 Mar 2013 04:55:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591990#comment-13591990
] 

ramkrishna.s.vasudevan commented on HBASE-7985:
-----------------------------------------------

This is what is happening.  
Whenever the test passes, the the RS that we abort in the test is the one that carried META.
 Whenever the test fails the RS that we abort does not have the META.
As part of master initialization i saw that the following code exists, it checks if META was
down on master restart
{code}
// log splitting for .META. server
    ServerName preMetaServer = this.catalogTracker.getMetaLocationOrReadLocationFromRoot();
    if (preMetaServer != null && failedServers.contains(preMetaServer)) {
      // create recovered edits file for .META. server
      this.fileSystemManager.splitLog(preMetaServer);
      failedServers.remove(preMetaServer);
    }
{code}
Here we see that the META RS is removed from the failedServer.
In a normal case where RS that did not carry META was aborted, the below code will add the
server to the ServerManager.deadServer list.
{code}
for (ServerName curServer : failedServers) {
      this.serverManager.processDeadServer(curServer);
    }
{code}
Now when the RS is part of the deadServer, when master tries to do AM.joinCluser(), if we
find the deadServer name in any of the znode that was left over we tend to skip the znode
expecting SSH to take care.
{code}
Set<ServerName> actualDeadServers = this.serverManager.getDeadServers();
      for (Map.Entry<ServerName, List<Pair<HRegionInfo, Result>>> deadServer
: 
        deadServers.entrySet()) {
        // skip regions of dead servers because SSH will process regions during rs expiration.
        // see HBASE-5916
        if (actualDeadServers.contains(deadServer.getKey())) {
          for (Pair<HRegionInfo, Result> deadRegion : deadServer.getValue()) {
            nodes.remove(deadRegion.getFirst().getEncodedName());
          }
          continue;
        }
{code}
In case where the RS was having the META, because the RS was not added to deadServer list
we were able to bypass this 'continue' and the test went on fine.  I remember HBASE-5916 was
there for a long time but was the test failing so frequently for the past 2 or 3 months?
                
> TestMasterFailover.testMasterFailoverWithMockedRITOnDeadRS fails frequently in 0.94
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-7985
>                 URL: https://issues.apache.org/jira/browse/HBASE-7985
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>             Fix For: 0.94.7
>
>
> 4 failures of this test in the last 6 builds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message