Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6196C10108 for ; Fri, 11 Oct 2013 02:27:42 +0000 (UTC) Received: (qmail 29919 invoked by uid 500); 11 Oct 2013 02:27:42 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 29836 invoked by uid 500); 11 Oct 2013 02:27:42 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 29827 invoked by uid 99); 11 Oct 2013 02:27:42 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Oct 2013 02:27:42 +0000 Date: Fri, 11 Oct 2013 02:27:42 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-9743) RollingBatchRestartRsAction aborts if timeout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-9743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792266#comment-13792266 ] Hudson commented on HBASE-9743: ------------------------------- FAILURE: Integrated in hbase-0.96 #133 (See [https://builds.apache.org/job/hbase-0.96/133/]) HBASE-9743 RollingBatchRestartRsAction aborts if timeout (stack: rev 1531151) * /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchRestartRsAction.java > RollingBatchRestartRsAction aborts if timeout > --------------------------------------------- > > Key: HBASE-9743 > URL: https://issues.apache.org/jira/browse/HBASE-9743 > Project: HBase > Issue Type: Bug > Components: test > Reporter: stack > Assignee: stack > Fix For: 0.98.0, 0.96.0 > > Attachments: 9743.txt, 9743v2.txt > > > In our test rigs, we see following quiet frequently: > {code} > 2013-10-10 05:04:09,367 INFO [Thread-6] actions.Action: Killing region server:a1809.halxg.cloudera.com,60020,1381404629253 > 2013-10-10 05:04:09,367 INFO [Thread-6] hbase.HBaseCluster: Aborting RS: a1809.halxg.cloudera.com,60020,1381404629253 > 2013-10-10 05:04:09,367 INFO [Thread-6] hbase.ClusterManager: Executing remote command: ps aux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL , hostname:a1809.halxg.cloudera.com > 2013-10-10 05:04:09,367 INFO [Thread-6] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no a1809.halxg.cloudera.com "ps aux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL"] > 2013-10-10 05:04:09,621 DEBUG [Thread-5] client.HBaseAdmin: Getting current status of snapshot from master... > 2013-10-10 05:04:09,623 DEBUG [Thread-5] client.HBaseAdmin: (#6) Sleeping: 1714ms while waiting for snapshot completion. > 2013-10-10 05:04:10,381 WARN [Thread-6] policies.Policy: Exception occured during performing action: org.apache.hadoop.util.Shell$ExitCodeException: Connection timed out during banner exchange > at org.apache.hadoop.util.Shell.runCommand(Shell.java:458) > at org.apache.hadoop.util.Shell.run(Shell.java:373) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578) > at org.apache.hadoop.hbase.HBaseClusterManager$RemoteShell.execute(HBaseClusterManager.java:111) > at org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:187) > at org.apache.hadoop.hbase.HBaseClusterManager.signal(HBaseClusterManager.java:216) > at org.apache.hadoop.hbase.ClusterManager.kill(ClusterManager.java:97) > at org.apache.hadoop.hbase.DistributedHBaseCluster.killRegionServer(DistributedHBaseCluster.java:110) > at org.apache.hadoop.hbase.chaos.actions.Action.killRs(Action.java:84) > at org.apache.hadoop.hbase.chaos.actions.RollingBatchRestartRsAction.perform(RollingBatchRestartRsAction.java:60) > at org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.runOneIteration(PeriodicRandomActionPolicy.java:59) > at org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) > at org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) > at java.lang.Thread.run(Thread.java:724) > ... > {code} > So, we went to kill a RS and we timed out. Server was busy at the time. We see the kill usually going through. > When above happens in a RollingBatchRestartRsAction, we'll usually 'lose' a server for the rest of the test. That is at a minimum. We've also seen case where we kill near all servers in cluster and then the above timeout happens and we are left w/ a test limping along running real slow eventually failing. -- This message was sent by Atlassian JIRA (v6.1#6144)