Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0D3A9200B32 for ; Thu, 23 Jun 2016 22:59:47 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0BCDA160A59; Thu, 23 Jun 2016 20:59:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 52DA9160A35 for ; Thu, 23 Jun 2016 22:59:46 +0200 (CEST) Received: (qmail 18357 invoked by uid 500); 23 Jun 2016 20:59:45 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 18337 invoked by uid 99); 23 Jun 2016 20:59:45 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2016 20:59:45 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 03C33E38B1; Thu, 23 Jun 2016 20:59:44 +0000 (UTC) From: ShawnWalker To: dev@accumulo.apache.org Reply-To: dev@accumulo.apache.org Message-ID: Subject: [GitHub] accumulo pull request #121: ACCUMULO-4353: Stabilize tablet assignment durin... Content-Type: text/plain Date: Thu, 23 Jun 2016 20:59:44 +0000 (UTC) archived-at: Thu, 23 Jun 2016 20:59:47 -0000 GitHub user ShawnWalker opened a pull request: https://github.com/apache/accumulo/pull/121 ACCUMULO-4353: Stabilize tablet assignment during transient failure. Added configuration property `table.suspend.duration` (default 0s): When a tablet server dies, instead of immediately placing its tablets in the `TabletState.UNASSIGNED` state, they are instead moved to the new `TabletState.SUSPENDED` state. A suspended tablet will only be reassigned if (a) `table.suspend.duration` has passed since the tablet was suspended, or (b) the tablet server most recently hosting the tablet has come back online. In the latter case, the tablet will be assigned back to its previous host. Added configuration property `master.metadata.suspendable` (default false): The above functionality is really meant to be used only on user tablets. Suspending metadata tablets can lead to much more significant loss of availability. Despite this, it is possible to set `table.suspend.duration` on `accumulo.metadata`. If one really wishes to allow metadata tablets to be suspended as well, one must also set the `master.metadata.suspendable` to true. I chose not to implement suspension of the root tablet. Implementation outline: * `master.MasterTime` maintains an approximately monotonic clock; this is used by suspension to determine how much time has passed since a tablet was suspended. `MasterTime` periodically writes its time base to ZooKeeper for persistence. * The `server.master.state.TabletState` enum now has a `TabletState.SUSPENDED` state. `TabletLocationState`, `MetaDataStateStore` were updated to properly read and write suspensions. * `server.master.state.TabletStateStore` now features a `suspend(...)` method, for suspending a tablet, with implementation in `MetaDataStateStore`. `suspend(...)` acts just as `unassign(...)`, except that it writes additional metadata indicating when each tablet was suspended, and which tablet server it was suspended from. * `master.TabletServerWatcher` updated to properly transition to/from `TabletState.SUSPENDED`. * `master.Master` updated to avoid balancing while any tablets remain suspended. * Minor changes to various miniaccumulo classes to facilitate testing tablet suspension. Particularly, the abilities to (a) restart a server with a configuration different from the rest, and (b) restart fewer than the full complement of tablet servers. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ShawnWalker/accumulo ACCUMULO-4353 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/121.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #121 ---- commit 8d097c5f4388be83e90325e5b6d674ad21a6121b Author: Shawn Walker Date: 2016-06-21T17:34:34Z ACCUMULO-4353: Stabilize tablet assignment during transient failure. Added configuration property `table.suspend.duration` (default 0s): When a tablet server dies, instead of immediately placing its tablets in the TabletState.UNASSIGNED state, they are instead moved to the new TabletState.SUSPENDED state. A suspended tablet will only be reassigned if (a) table.suspend.duration has passed since the tablet was suspended, or (b) the tablet server most recently hosting the tablet has come back online. In the latter case, the tablet will be assigned back to its previous host. Added configuration property `master.metadata.suspendable` (default false): The above functionality is really meant to be used only on user tablets. Suspending metadata tablets can lead to much more significant loss of availability. Despite this, it is possible to set `table.suspend.duration` on `accumulo.metadata`. If one really wishes to allow metadata tablets to be suspended as well, one must also set the `master.metadata.suspendable` to true. I chose not to implement suspension of the root tablet. Implementation outline: * `master.MasterTime` maintains an approximately monotonic clock; this is used by suspension to determine how much time has passed since a tablet was suspended. `MasterTime` periodically writes its time base to ZooKeeper for persistence. * The `server.master.state.TabletState` now has a `TabletState.SUSPENDED` state. `TabletLocationState`, `MetaDataStateStore` were updated to properly read and write suspensions. * `server.master.state.TabletStateStore` now features a `suspend(...)` method, for suspending a tablet, with implementations in `MetaDataStateStore`. `suspend(...)` acts just as `unassign(...)`, except that it writes additional metadata indicating when each tablet was suspended, and which tablet server it was suspended from. * `master.TabletServerWatcher` updated to properly transition to/from `TabletState.SUSPENDED`. * `master.Master` updated to avoid balancing while any tablets remain suspended. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---