hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
Date Mon, 09 Mar 2015 05:10:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352559#comment-14352559
] 

Jeffrey Zhong commented on HBASE-13172:
---------------------------------------

I just skimmed through the thread. It seems the test was stucked in isServerReachable(). 

[~Apache9] In order to make the test case stable you can set config "hbase.master.maximum.ping.server.attempts"
to 3(by default it's 10). For isServerReachable() call, inside IOException catch block, we
should check following conditions and return false immediately when any of them is true.
1) if current server is put in deadServer already
2) If current IOException is one of RegionServerStoppedException or ServerNotRunningYetException

[~jxiang] The following code inside RegionStates seems unnecessary and should just return
false(because the result of isServerReachable call may still return false positive info after
retries) . In addition, should we expire the server instead directly put it in deadServers?
Thanks.

{code}
        if (serverManager.isServerReachable(server)) {
          return false;
        }
        // The size of deadServers won't grow unbounded.
        deadServers.put(hostAndPort, Long.valueOf(startCode));
{code}

> TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-13172
>                 URL: https://issues.apache.org/jira/browse/HBASE-13172
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.1.0
>            Reporter: zhangduo
>
> The direct reason is we are stuck in ServerManager.isServerReachable.
> https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
> {noformat}
> 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach
asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
> 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach
asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> {noformat}
> The interval between first and last retry log is about 1 minute, and we only wait 1 minute
so the test is timeout.
> Still do not know why this happen.
> And at last there are lots of this 
> {noformat}
> 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach
asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
> 	at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
> 	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
> 	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
> 	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
> 	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
> 	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
> 	at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
> 	at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
> 	at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:744)
> {noformat}
> I think the problem is here
> {code:title=ServerManager.java}
>     while (retryCounter.shouldRetry()) {
>         ...
>         try {
>           retryCounter.sleepUntilNextRetry();
>         } catch(InterruptedException ie) {
>           Thread.currentThread().interrupt();
>         }
>         ...
>     }
> {code}
> We need to break out of the while loop when getting InterruptedException, not just mark
current thread as interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message