Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 11 Jul 2013 03:27:48 +0000 (UTC)
From: "Elliott Clark (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12657174.1373499263775.30617.1373513268505@arcas>
In-Reply-To: <JIRA.12657174.1373499263775@arcas>
References: <JIRA.12657174.1373499263775@arcas>
Subject: [jira] [Updated] (HBASE-8924) Master Can fail to come up after
 chaos monkey if the sleep time is too short.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HBASE-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Elliott Clark updated HBASE-8924:
---------------------------------

    Attachment: hbase-hbase-master-a1805.halxg.cloudera.com.log.gz

Here's the log that contains the failed restart.


Here's the log from the test trying to bring master back up.
{code}
2013-07-10 18:02:06,423 INFO  [pool-1-thread-4] hbase.ClusterManager: Executed remote command, exit code:0 , output:
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] util.ChaosMonkey: Killed master server:a1805.halxg.cloudera.com,60000,1373500144613
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] util.ChaosMonkey: Sleeping for:0
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] util.ChaosMonkey: Starting master:a1805.halxg.cloudera.com
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] hbase.HBaseCluster: Starting Master on: a1805.halxg.cloudera.com
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] hbase.ClusterManager: Executing remote command: /opt/hbase/current/bin/../bin/hbase-daemon.sh  start master , hostname:a1805.halxg.cloudera.com
2013-07-10 18:02:06,425 INFO  [pool-1-thread-4] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no a1805.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh  start master"]
2013-07-10 18:02:06,426 WARN  [pool-1-thread-7] client.HConnectionManager$HConnectionImplementation: Checking master connection
com.google.protobuf.ServiceException: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: a1805.halxg.cloudera.com/10.20.200.105:60000
	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1589)
	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
	at org.apache.hadoop.hbase.protobuf.generated.MasterMonitorProtos$MasterMonitorService$BlockingStub.isMasterRunning(MasterMonitorProtos.java:3021)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterMonitorServiceState.isMasterRunning(HConnectionManager.java:1273)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:1916)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterMonitorService(HConnectionManager.java:1866)
	at org.apache.hadoop.hbase.client.HBaseAdmin.execute(HBaseAdmin.java:2682)
	at org.apache.hadoop.hbase.client.HBaseAdmin.getClusterStatus(HBaseAdmin.java:1945)
	at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$AdminCallable.doAction(IntegrationTestMTTR.java:470)
	at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$TimingCallable.call(IntegrationTestMTTR.java:370)
	at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$TimingCallable.call(IntegrationTestMTTR.java:353)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: a1805.halxg.cloudera.com/10.20.200.105:60000
	at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:828)
	at org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1455)
	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1347)
	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
	... 15 more
{code}
                
> Master Can fail to come up after chaos monkey if the sleep time is too short.
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-8924
>                 URL: https://issues.apache.org/jira/browse/HBASE-8924
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
>         Attachments: hbase-hbase-master-a1805.halxg.cloudera.com.log.gz
>
>
> On a real cluster the master won't come up if the sleep time between killing and starting is too short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira