kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From granthe...@apache.org
Subject [1/4] kudu git commit: [docs] Add tip on dealing with planned TS downtime
Date Fri, 21 Sep 2018 13:45:33 GMT
Repository: kudu
Updated Branches:
  refs/heads/master 816bc6fd8 -> fd1ffd0fb

[docs] Add tip on dealing with planned TS downtime

Rendering available at

Change-Id: I55a992a00f35945187e02c55594edc6e261a72c4
Reviewed-on: http://gerrit.cloudera.org:8080/11486
Reviewed-by: Andrew Wong <awong@cloudera.com>
Reviewed-by: Grant Henke <granthenke@apache.org>
Tested-by: Will Berkeley <wdberkeley@gmail.com>

Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/3a033d82
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/3a033d82
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/3a033d82

Branch: refs/heads/master
Commit: 3a033d829cd6aab17995b68371e7e136c47cc9b8
Parents: 816bc6f
Author: Will Berkeley <wdberkeley@gmail.org>
Authored: Thu Sep 20 12:23:41 2018 -0700
Committer: Will Berkeley <wdberkeley@gmail.com>
Committed: Thu Sep 20 21:32:51 2018 +0000

 docs/administration.adoc | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/docs/administration.adoc b/docs/administration.adoc
index 74de5a0..b176f58 100644
--- a/docs/administration.adoc
+++ b/docs/administration.adoc
@@ -1120,6 +1120,43 @@ a node onto another machine.
 . Start all Kudu processes in the cluster.
+=== Minimizing cluster disruption during temporary planned downtime of a single tablet server
+If a single tablet server is brought down temporarily in a healthy cluster, all
+tablets will remain available and clients will function as normal, after
+potential short delays due to leader elections. However, if the downtime lasts
+for more than `--follower_unavailable_considered_failed_sec` (default 300)
+seconds, the tablet replicas on the down tablet server will be replaced by new
+replicas on available tablet servers. This will cause stress on the cluster
+as tablets re-replicate and, if the downtime lasts long enough, significant
+reduction in the number of replicas on the down tablet server. This may require
+the rebalancer to fix.
+To work around this, increase `--follower_unavailable_considered_failed_sec` on
+all tablet servers so the amount of time before re-replication will start is
+longer than the expected downtime of the tablet server, including the time it
+takes the tablet server to restart and bootstrap its tablet replicas. To do
+this, run the following command for each tablet server:
+$ sudo -u kudu kudu tserver set_flag <tserver_address> follower_unavailable_considered_failed_sec
+where `<num_seconds>` is the number of seconds that will encompass the downtime.
+Once the downtime is finished, reset the flag to its original value.
+$ sudo -u kudu kudu tserver set_flag <tserver_address> follower_unavailable_considered_failed_sec
+WARNING: Be sure to reset the value of `--follower_unavailable_considered_failed_sec`
+to its original value.
+NOTE: On Kudu versions prior to 1.8, the `--force` flag must be provided in the above
 === Running the tablet rebalancing tool

View raw message