Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3A6817E37 for ; Fri, 31 Oct 2014 23:34:34 +0000 (UTC) Received: (qmail 35267 invoked by uid 500); 31 Oct 2014 23:34:33 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 35212 invoked by uid 500); 31 Oct 2014 23:34:33 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 35135 invoked by uid 99); 31 Oct 2014 23:34:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Oct 2014 23:34:33 +0000 Date: Fri, 31 Oct 2014 23:34:33 +0000 (UTC) From: "Nick Dimiduk (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-12403) IntegrationTestMTTR flaky due to aggressive RS restart timeout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Nick Dimiduk created HBASE-12403: ------------------------------------ Summary: IntegrationTestMTTR flaky due to aggressive RS restart timeout Key: HBASE-12403 URL: https://issues.apache.org/jira/browse/HBASE-12403 Project: HBase Issue Type: Test Components: integration tests Reporter: Nick Dimiduk Priority: Minor TL;DR: the CM RestartRS action timeout is only 60 seconds. Considering the RS must connect to the Master before it can be online, this is not long enough time in an environment where the Master can also be killed. Failure from the console says the test failed because a RestartRsHoldingMetaAction timed out. {noformat} Caused by: java.io.IOException: did timeout waiting for region server to start:ip-172-31-42-248.ec2.internal at org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:153) at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:93) at org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartRs(RestartActionBaseAction.java:52) at org.apache.hadoop.hbase.chaos.actions.RestartRsHoldingMetaAction.perform(RestartRsHoldingMetaAction.java:38) at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:559) at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:550) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} This is only reported at the end of the test run. There's no indication as to when during the test run this failure happened. The timeout on the start RS operation is 60 seconds. Hacking out the start/stop messages from the logs during the time window when this test ran, it appears that at one point the RS took 2min 12s between when it was launched and when it reported for duty {noformat} Fri Oct 31 14:53:17 UTC 2014 Starting regionserver on ip-172-31-42-248 2014-10-31 14:55:29,049 INFO [regionserver60020] regionserver.HRegionServer: Serving as ip-172-31-42-248.ec2.internal,60020,1414767238992, RpcServer on ip-172-31-42-248.ec2.internal/172.31.42.248:60020, sessionid=0x249661c2b7b0118 {noformat} The RS came up without incident. It spent 1min 4s of that time waiting on the master to start, attempted to report for duty from 14:54:28 to 14:55:24. -- This message was sent by Atlassian JIRA (v6.3.4#6332)