hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
Date Sun, 08 Mar 2015 23:28:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352339#comment-14352339
] 

Jimmy Xiang commented on HBASE-13172:
-------------------------------------

Some tests at branch-1 are more flaky than in master because we may kill RS holding meta which
takes longer to recover.  In master, there is no such issue since meta is on master all the
time. This also means it is usually a bug if some assignment related test is flaky in master.
For branch-1, it is a little complicated.

You are right this test is not meant to test region assignment. If we can assure the 3 RS
killed don't hold meta, the test may not be that flaky. We can have another test for meta
handling if there is not such a testcase already.

> TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-13172
>                 URL: https://issues.apache.org/jira/browse/HBASE-13172
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.1.0
>            Reporter: zhangduo
>
> The direct reason is we are stuck in ServerManager.isServerReachable.
> https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
> {noformat}
> 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach
asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
> 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach
asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> {noformat}
> The interval between first and last retry log is about 1 minute, and we only wait 1 minute
so the test is timeout.
> Still do not know why this happen.
> And at last there are lots of this 
> {noformat}
> 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach
asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
> 	at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
> 	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
> 	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
> 	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
> 	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
> 	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
> 	at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
> 	at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
> 	at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:744)
> {noformat}
> I think the problem is here
> {code:title=ServerManager.java}
>     while (retryCounter.shouldRetry()) {
>         ...
>         try {
>           retryCounter.sleepUntilNextRetry();
>         } catch(InterruptedException ie) {
>           Thread.currentThread().interrupt();
>         }
>         ...
>     }
> {code}
> We need to break out of the while loop when getting InterruptedException, not just mark
current thread as interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message