Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96F5317D6F for ; Mon, 9 Mar 2015 10:16:13 +0000 (UTC) Received: (qmail 56605 invoked by uid 500); 9 Mar 2015 10:15:39 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 56502 invoked by uid 500); 9 Mar 2015 10:15:39 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 56224 invoked by uid 99); 9 Mar 2015 10:15:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 10:15:39 +0000 Date: Mon, 9 Mar 2015 10:15:39 +0000 (UTC) From: "zhangduo (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangduo updated HBASE-13172: ----------------------------- Assignee: zhangduo Status: Patch Available (was: Open) > TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1 > ---------------------------------------------------------------------------- > > Key: HBASE-13172 > URL: https://issues.apache.org/jira/browse/HBASE-13172 > Project: HBase > Issue Type: Bug > Components: test > Affects Versions: 1.1.0 > Reporter: zhangduo > Assignee: zhangduo > Attachments: HBASE-13172-branch-1.patch > > > The direct reason is we are stuck in ServerManager.isServerReachable. > https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/ > {noformat} > 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10 > 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10 > {noformat} > The interval between first and last retry log is about 1 minute, and we only wait 1 minute so the test is timeout. > Still do not know why this happen. > And at last there are lots of this > {noformat} > 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10 > org.apache.hadoop.hbase.ipc.StoppedRpcClientException > at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261) > at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146) > at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) > at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) > at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031) > at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797) > at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850) > at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843) > at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576) > at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} > I think the problem is here > {code:title=ServerManager.java} > while (retryCounter.shouldRetry()) { > ... > try { > retryCounter.sleepUntilNextRetry(); > } catch(InterruptedException ie) { > Thread.currentThread().interrupt(); > } > ... > } > {code} > We need to break out of the while loop when getting InterruptedException, not just mark current thread as interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)