hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12403) IntegrationTestMTTR flaky due to aggressive RS restart timeout
Date Sat, 01 Nov 2014 00:05:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192790#comment-14192790

Enis Soztutar commented on HBASE-12403:


> IntegrationTestMTTR flaky due to aggressive RS restart timeout
> --------------------------------------------------------------
>                 Key: HBASE-12403
>                 URL: https://issues.apache.org/jira/browse/HBASE-12403
>             Project: HBase
>          Issue Type: Test
>          Components: integration tests
>            Reporter: Nick Dimiduk
>            Assignee: Nick Dimiduk
>            Priority: Minor
>             Fix For: 2.0.0, 0.98.8, 0.99.2
>         Attachments: HBASE-12403.00.patch
> TL;DR: the CM RestartRS action timeout is only 60 seconds. Considering the RS must connect
to the Master before it can be online, this is not long enough time in an environment where
the Master can also be killed.
> Failure from the console says the test failed because a RestartRsHoldingMetaAction timed
> {noformat}
> Caused by: java.io.IOException: did timeout waiting for region server to start:ip-172-31-42-248.ec2.internal
> at org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:153)
> at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:93)
> at org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartRs(RestartActionBaseAction.java:52)
> at org.apache.hadoop.hbase.chaos.actions.RestartRsHoldingMetaAction.perform(RestartRsHoldingMetaAction.java:38)
> at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:559)
> at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:550)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This is only reported at the end of the test run. There's no indication as to when during
the test run this failure happened. The timeout on the start RS operation is 60 seconds.
> Hacking out the start/stop messages from the logs during the time window when this test
ran, it appears that at one point the RS took 2min 12s between when it was launched and when
it reported for duty
> {noformat}
> Fri Oct 31 14:53:17 UTC 2014 Starting regionserver on ip-172-31-42-248
> 2014-10-31 14:55:29,049 INFO  [regionserver60020] regionserver.HRegionServer: Serving
as ip-172-31-42-248.ec2.internal,60020,1414767238992, RpcServer on ip-172-31-42-248.ec2.internal/,
> {noformat}
> The RS came up without incident. It spent 1min 4s of that time waiting on the master
to start, attempted to report for duty from 14:54:28 to 14:55:24.

This message was sent by Atlassian JIRA

View raw message