Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 954D8200B39 for ; Fri, 24 Jun 2016 18:25:18 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 93E6A160A2E; Fri, 24 Jun 2016 16:25:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EF256160A58 for ; Fri, 24 Jun 2016 18:25:17 +0200 (CEST) Received: (qmail 29068 invoked by uid 500); 24 Jun 2016 16:25:17 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 28691 invoked by uid 99); 24 Jun 2016 16:25:16 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2016 16:25:16 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C10342C1F62 for ; Fri, 24 Jun 2016 16:25:16 +0000 (UTC) Date: Fri, 24 Jun 2016 16:25:16 +0000 (UTC) From: "Josh Elser (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-4353) Stabilize tablet assignment during transient failure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 24 Jun 2016 16:25:18 -0000 [ https://issues.apache.org/jira/browse/ACCUMULO-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348520#comment-15348520 ] Josh Elser commented on ACCUMULO-4353: -------------------------------------- bq. Ahh, my mistake then. As a new contributor to Accumulo, I still don't have a full grasp of the rules, either written or unwritten. My feeling from watching the list was that primary modus operandi was to present a (fully implemented) solution along with a proposed problem, and then to discuss the merits of the solution. In general at Apache, it's best to discuss a large/intrusive change to the codebase before you spend time writing code (in case there is consensus on an approach different than what you thought best). It goes back to the community-first aspect. It's fine that you provided code right away (I don't mean to be scolding), I just don't want to see you waste a week writing some code for a consensus that a different direction than what you wrote to was agreed upon. > Stabilize tablet assignment during transient failure > ---------------------------------------------------- > > Key: ACCUMULO-4353 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4353 > Project: Accumulo > Issue Type: Improvement > Reporter: Shawn Walker > Assignee: Shawn Walker > Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > When a tablet server dies, Accumulo attempts to reassign the tablets it was hosting as quickly as possible to maintain availability. If multiple tablet servers die in quick succession, such as from a rolling restart of the Accumulo cluster or a network partition, this behavior can cause a storm of reassignment and rebalancing, placing significant load on the master. > To avert such load, Accumulo should be capable of maintaining a steady tablet assignment state in the face of transient tablet server loss. Instead of reassigning tablets as quickly as possible, Accumulo should be await the return of a temporarily downed tablet server (for some configurable duration) before assigning its tablets to other tablet servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)