Mailing-List: contact commits-help@kudu.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@kudu.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: alexey@apache.org
To: commits@kudu.apache.org
Message-Id: <23f04482909446b4ac5b8367f18ef6ee@git.apache.org>
Subject: kudu git commit: [docs] Add admin workflow for recovering from disk
 failure
Date: Wed, 19 Apr 2017 17:54:51 +0000 (UTC)
archived-at: Wed, 19 Apr 2017 17:54:53 -0000

Repository: kudu
Updated Branches:
  refs/heads/branch-1.3.x adb314d94 -> 3211d9781


[docs] Add admin workflow for recovering from disk failure

I didn't document how to rebalance tablets onto the repaired tserver if
necessary, since the process is complicated and error prone, and we hope
to have a rebalancing tool in the future. These docs will quickly become
outdated when KUDU-616 is fixed, but I think it's worth it to document
since we frequently receive questions on the topic.

Change-Id: I6541bffc5e9546c523df610fd8c025dd05e403bf
Reviewed-on: http://gerrit.cloudera.org:8080/6606
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <adar@cloudera.com>
Reviewed-by: Andrew Wong <awong@cloudera.com>
(cherry picked from commit 87154f4a39c77ab92d80f3effa58de3000921127)
Reviewed-on: http://gerrit.cloudera.org:8080/6677
Reviewed-by: Hao Hao <hao.hao@cloudera.com>
Reviewed-by: Jean-Daniel Cryans <jdcryans@apache.org>
Tested-by: Dan Burkert <danburkert@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/3211d978
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/3211d978
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/3211d978

Branch: refs/heads/branch-1.3.x
Commit: 3211d9781c13f2ddd990c47d097125dd1086032e
Parents: adb314d
Author: Dan Burkert <danburkert@apache.org>
Authored: Mon Apr 10 17:46:36 2017 -0700
Committer: Dan Burkert <danburkert@apache.org>
Committed: Wed Apr 19 17:40:30 2017 +0000

----------------------------------------------------------------------
 docs/administration.adoc | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/3211d978/docs/administration.adoc
----------------------------------------------------------------------
diff --git a/docs/administration.adoc b/docs/administration.adoc
index 7003160..813d097 100644
--- a/docs/administration.adoc
+++ b/docs/administration.adoc
@@ -585,3 +585,38 @@ be done with the following command:
 ----
 $ kudu cluster ksck --checksum_scan --tables IntegrationTestBigLinkedList master-01.example.com,master-02.example.com,master-03.example.com
 ----
+
+[[disk_failure_recovery]]
+=== Recovering from Disk Failure
+
+// TODO(dan): revise this once KUDU-616 is fixed.
+Kudu tablet servers are not resistent to disk failure. When a disk containing a
+data directory or the write-ahead log (WAL) dies, the entire tablet server must
+be rebuilt. Kudu will automatically re-replicate tablets on other servers after
+a tablet server fails, but manual intervention is needed in order to restore the
+failed tablet server to a running state.
+
+The first step to restoring a tablet server after a disk failure is to replace
+the failed disk, or remove the failed disk from the data-directory and/or WAL
+configuration. Next, the existing data directories and WAL directory must be
+removed. For example, if the tablet server is configured with
+`--fs_wal_dir=/data/0/kudu-tserver-wal` and
+`--fs_data_dirs=/data/1/kudu-tserver,/data/2/kudu-tserver`, the following
+commands will remove the existing data directories and WAL directory:
+
+[source,bash]
+----
+$ rm -rf /data/0/kudu-tserver-wal /data/1/kudu-tserver /data/2/kudu-tserver
+----
+
+After the WAL and data directories are removed, the tablet server process can be
+started. When Kudu is installed using system packages, `service` is typically
+used:
+
+[source,bash]
+----
+$ sudo service kudu-tserver start
+----
+
+Once the tablet server is running again, new tablet replicas will be created on
+it as necessary.