Subject [2/7] git commit: ACCUMULO-1217 Add documentation about and to recover from process failure.
Date Tue, 25 Mar 2014 00:37:28 GMT
ACCUMULO-1217 Add documentation about and to recover from process


Branch: refs/heads/master
Commit: 3e749fb2cc05a4fdae9753d97ffa99bff5aeb065
Parents: 62ce752
Author: Josh Elser <>
Authored: Mon Mar 24 17:26:08 2014 -0700
Committer: Josh Elser <>
Committed: Mon Mar 24 17:26:08 2014 -0700

 .../chapters/troubleshooting.tex                | 41 ++++++++++++++++++++
 1 file changed, 41 insertions(+)
diff --git a/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex b/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex
index 18d472f..3e7572d 100644
--- a/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex
+++ b/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex
@@ -518,6 +518,47 @@ Besides these columns, you may see:
+\section{Simple System Recovery}
+Q. One of my Accumulo processes died. How do I bring it back?
+The easiest way to bring all services online for an Accumulo instance is to run the ````
+  $ bin/
+This process will check the process listing, using ``jps`` on each host before attempting
to restart a service on the given host.
+Typically, this check is sufficient except in the face of a hung/zombie process. For large
clusters, it may be
+undesirable to ssh to every node in the cluster to ensure that all hosts are running the
appropriate processes and ```` may be of use.
+  $ ssh host_with_dead_process
+  $ bin/
+```` should be invoked on the host which is missing a given process. Like,
it will start all
+necessary processes that are not currently running, but only on the current host and not
cluster-wide. Tools such as ``pssh`` or 
+``pdsh`` can be used to automate this process.
+```` can also be used to start a process on a given host; however, it is not
generally recommended for
+users to issue this directly as the ```` and ```` scripts provide
the same functionality with
+more automation and are less prone to user error.
+A. Use ```` or ````.
+Q. My process died again. Should I restart it via ``cron`` or tools like ``supervisord``?
+A. A repeatedly dying Accumulo process is a sign of a larger problem. Typically these problems
are due to a
+misconfiguration of Accumulo or over-saturation of resources. Blind automation of any service
restart inside of Accumulo
+is generally an undesirable situation as it is indicative of a problem that is being masked
and ignored. Accumulo
+processes should be stable on the order of months and not require frequent restart.
 \section{Advanced System Recovery}
 Q. I had disasterous HDFS failure.  After bringing everything back up, several tablets refuse
to go online.

