hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9750) Add retries around Action server stop/start
Date Sat, 12 Oct 2013 20:49:42 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793466#comment-13793466
] 

stack commented on HBASE-9750:
------------------------------

Thanks Enis.  Here is what I had so far. 

Adds a 'fix' for issue I saw yesterday where we failed start but seems like the machine was
loaded so the tail on the .out was against a .out that had not yet been created (Elliott suggestion).

{code}
diff --git a/bin/hbase-daemon.sh b/bin/hbase-daemon.sh
index caa74d8..474e267 100755
--- a/bin/hbase-daemon.sh
+++ b/bin/hbase-daemon.sh
@@ -176,7 +176,8 @@ case $startStop in
     hbase_rotate_log $loggc
     echo starting $command, logging to $logout
     nohup $thiscmd --config "${HBASE_CONF_DIR}" internal_start $command $args < /dev/null
> ${logout} 2>&1  &
-    sleep 1; head "${logout}"
+    # Don't fail if we can't get the first line -- keep going w/ startup.
+    sleep 1; head "${logout} || true"
   ;;
{code}

Here are some fixes for logging and then I'd started to add dumb retries around the start
of a regionserver so we stop losing them.

{code}
diff --git a/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java b/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
index 6900291..962eb3f 100644
--- a/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
+++ b/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
@@ -35,6 +35,7 @@ import org.apache.hadoop.hbase.ServerName;
 import org.apache.hadoop.hbase.chaos.monkies.PolicyBasedChaosMonkey;
 import org.apache.hadoop.hbase.client.HBaseAdmin;
 import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.hadoop.hbase.util.Threads;

 /**
  * A (possibly mischievous) action that the ChaosMonkey can perform.
@@ -89,8 +90,19 @@ public class Action {

   protected void startRs(ServerName server) throws IOException {
     LOG.info("Starting region server:" + server.getHostname());
-    cluster.startRegionServer(server.getHostname());
-    cluster.waitForRegionServerToStart(server.getHostname(), PolicyBasedChaosMonkey.TIMEOUT);
+    // Retry up to 3 times. This is hardcoded for now.  We don't want to retry forever.
+    final int retries = 3;
+    for (int i = 0; i < retries; i++) {
+      try {
+        cluster.startRegionServer(server.getHostname());
+        cluster.waitForRegionServerToStart(server.getHostname(), PolicyBasedChaosMonkey.TIMEOUT);
+      } catch (org.apache.hadoop.util.Shell.ExitCodeException e) {
+        // The start may fail but better to just keep going though we may lose server.
+        LOG.info("Problem starting " + server + "; code=" + e.getExitCode(), e);
+        Threads.sleep(1000);
+      }
+    }
+    x
     LOG.info("Started region server:" + server + ". Reported num of rs:"
         + cluster.getClusterStatus().getServersSize());
   }
diff --git a/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchRestartRsAction.java
b/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchRestartRsAction.java
index a4eefe7..f17d5a1 100644
--- a/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchRestartRsAction.java
+++ b/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchRestartRsAction.java
@@ -68,17 +68,17 @@ public class RollingBatchRestartRsAction extends BatchRestartRsAction
{
         } catch (org.apache.hadoop.util.Shell.ExitCodeException e) {
           // We've seen this in test runs where we timeout but the kill went through. HBASE-9743
           // So, add to deadServers even if exception so the start gets called.
-          LOG.info("Problem killing but presume successful; code=" + e.getExitCode(), e);
+          LOG.info("Problem killing " + server + " but presume successful; code=" +
+            e.getExitCode(), e);
         }
         deadServers.add(server);
       } else {
+        ServerName server = deadServers.remove();
         try {
-          ServerName server = deadServers.remove();
           startRs(server);
         } catch (org.apache.hadoop.util.Shell.ExitCodeException e) {
           // The start may fail but better to just keep going though we may lose server.
-          //
-          LOG.info("Problem starting, will retry; code=" + e.getExitCode(), e);
+          LOG.info("Problem starting " + server + "; code=" + e.getExitCode(), e);
         }
       }

{code}

Had not started on Master retries....

> Add retries around Action server stop/start
> -------------------------------------------
>
>                 Key: HBASE-9750
>                 URL: https://issues.apache.org/jira/browse/HBASE-9750
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: Enis Soztutar
>
> These can fail on occasion (my upping ConnectionTimeout is not enough).  Lets just retry
a few times at least rather than fail at least for server start.  Losing a server makes tests
run for longer and there is also the danger we could lose all servers and the long-running
test would then outright fail.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message