Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9DB8C10F33 for ; Thu, 11 Jul 2013 03:27:50 +0000 (UTC) Received: (qmail 75310 invoked by uid 500); 11 Jul 2013 03:27:49 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 75087 invoked by uid 500); 11 Jul 2013 03:27:49 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 75077 invoked by uid 99); 11 Jul 2013 03:27:48 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jul 2013 03:27:48 +0000 Date: Thu, 11 Jul 2013 03:27:48 +0000 (UTC) From: "Elliott Clark (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-8924) Master Can fail to come up after chaos monkey if the sleep time is too short. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-8924: --------------------------------- Attachment: hbase-hbase-master-a1805.halxg.cloudera.com.log.gz Here's the log that contains the failed restart. Here's the log from the test trying to bring master back up. {code} 2013-07-10 18:02:06,423 INFO [pool-1-thread-4] hbase.ClusterManager: Executed remote command, exit code:0 , output: 2013-07-10 18:02:06,424 INFO [pool-1-thread-4] util.ChaosMonkey: Killed master server:a1805.halxg.cloudera.com,60000,1373500144613 2013-07-10 18:02:06,424 INFO [pool-1-thread-4] util.ChaosMonkey: Sleeping for:0 2013-07-10 18:02:06,424 INFO [pool-1-thread-4] util.ChaosMonkey: Starting master:a1805.halxg.cloudera.com 2013-07-10 18:02:06,424 INFO [pool-1-thread-4] hbase.HBaseCluster: Starting Master on: a1805.halxg.cloudera.com 2013-07-10 18:02:06,424 INFO [pool-1-thread-4] hbase.ClusterManager: Executing remote command: /opt/hbase/current/bin/../bin/hbase-daemon.sh start master , hostname:a1805.halxg.cloudera.com 2013-07-10 18:02:06,425 INFO [pool-1-thread-4] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no a1805.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh start master"] 2013-07-10 18:02:06,426 WARN [pool-1-thread-7] client.HConnectionManager$HConnectionImplementation: Checking master connection com.google.protobuf.ServiceException: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: a1805.halxg.cloudera.com/10.20.200.105:60000 at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1589) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630) at org.apache.hadoop.hbase.protobuf.generated.MasterMonitorProtos$MasterMonitorService$BlockingStub.isMasterRunning(MasterMonitorProtos.java:3021) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterMonitorServiceState.isMasterRunning(HConnectionManager.java:1273) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:1916) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterMonitorService(HConnectionManager.java:1866) at org.apache.hadoop.hbase.client.HBaseAdmin.execute(HBaseAdmin.java:2682) at org.apache.hadoop.hbase.client.HBaseAdmin.getClusterStatus(HBaseAdmin.java:1945) at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$AdminCallable.doAction(IntegrationTestMTTR.java:470) at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$TimingCallable.call(IntegrationTestMTTR.java:370) at org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$TimingCallable.call(IntegrationTestMTTR.java:353) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: a1805.halxg.cloudera.com/10.20.200.105:60000 at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:828) at org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1455) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1347) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573) ... 15 more {code} > Master Can fail to come up after chaos monkey if the sleep time is too short. > ----------------------------------------------------------------------------- > > Key: HBASE-8924 > URL: https://issues.apache.org/jira/browse/HBASE-8924 > Project: HBase > Issue Type: Bug > Components: test > Reporter: Elliott Clark > Assignee: Elliott Clark > Attachments: hbase-hbase-master-a1805.halxg.cloudera.com.log.gz > > > On a real cluster the master won't come up if the sleep time between killing and starting is too short. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira