Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 99A16200C6A for ; Wed, 19 Apr 2017 19:54:53 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 98323160B9C; Wed, 19 Apr 2017 17:54:53 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DC6DD160B94 for ; Wed, 19 Apr 2017 19:54:52 +0200 (CEST) Received: (qmail 44406 invoked by uid 500); 19 Apr 2017 17:54:52 -0000 Mailing-List: contact commits-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kudu.apache.org Delivered-To: mailing list commits@kudu.apache.org Received: (qmail 44392 invoked by uid 99); 19 Apr 2017 17:54:52 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Apr 2017 17:54:52 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id E1901DFC31; Wed, 19 Apr 2017 17:54:51 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: alexey@apache.org To: commits@kudu.apache.org Message-Id: <23f04482909446b4ac5b8367f18ef6ee@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: kudu git commit: [docs] Add admin workflow for recovering from disk failure Date: Wed, 19 Apr 2017 17:54:51 +0000 (UTC) archived-at: Wed, 19 Apr 2017 17:54:53 -0000 Repository: kudu Updated Branches: refs/heads/branch-1.3.x adb314d94 -> 3211d9781 [docs] Add admin workflow for recovering from disk failure I didn't document how to rebalance tablets onto the repaired tserver if necessary, since the process is complicated and error prone, and we hope to have a rebalancing tool in the future. These docs will quickly become outdated when KUDU-616 is fixed, but I think it's worth it to document since we frequently receive questions on the topic. Change-Id: I6541bffc5e9546c523df610fd8c025dd05e403bf Reviewed-on: http://gerrit.cloudera.org:8080/6606 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo Reviewed-by: Andrew Wong (cherry picked from commit 87154f4a39c77ab92d80f3effa58de3000921127) Reviewed-on: http://gerrit.cloudera.org:8080/6677 Reviewed-by: Hao Hao Reviewed-by: Jean-Daniel Cryans Tested-by: Dan Burkert Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/3211d978 Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/3211d978 Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/3211d978 Branch: refs/heads/branch-1.3.x Commit: 3211d9781c13f2ddd990c47d097125dd1086032e Parents: adb314d Author: Dan Burkert Authored: Mon Apr 10 17:46:36 2017 -0700 Committer: Dan Burkert Committed: Wed Apr 19 17:40:30 2017 +0000 ---------------------------------------------------------------------- docs/administration.adoc | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/3211d978/docs/administration.adoc ---------------------------------------------------------------------- diff --git a/docs/administration.adoc b/docs/administration.adoc index 7003160..813d097 100644 --- a/docs/administration.adoc +++ b/docs/administration.adoc @@ -585,3 +585,38 @@ be done with the following command: ---- $ kudu cluster ksck --checksum_scan --tables IntegrationTestBigLinkedList master-01.example.com,master-02.example.com,master-03.example.com ---- + +[[disk_failure_recovery]] +=== Recovering from Disk Failure + +// TODO(dan): revise this once KUDU-616 is fixed. +Kudu tablet servers are not resistent to disk failure. When a disk containing a +data directory or the write-ahead log (WAL) dies, the entire tablet server must +be rebuilt. Kudu will automatically re-replicate tablets on other servers after +a tablet server fails, but manual intervention is needed in order to restore the +failed tablet server to a running state. + +The first step to restoring a tablet server after a disk failure is to replace +the failed disk, or remove the failed disk from the data-directory and/or WAL +configuration. Next, the existing data directories and WAL directory must be +removed. For example, if the tablet server is configured with +`--fs_wal_dir=/data/0/kudu-tserver-wal` and +`--fs_data_dirs=/data/1/kudu-tserver,/data/2/kudu-tserver`, the following +commands will remove the existing data directories and WAL directory: + +[source,bash] +---- +$ rm -rf /data/0/kudu-tserver-wal /data/1/kudu-tserver /data/2/kudu-tserver +---- + +After the WAL and data directories are removed, the tablet server process can be +started. When Kudu is installed using system packages, `service` is typically +used: + +[source,bash] +---- +$ sudo service kudu-tserver start +---- + +Once the tablet server is running again, new tablet replicas will be created on +it as necessary.