Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B727181D1 for ; Fri, 28 Aug 2015 16:40:46 +0000 (UTC) Received: (qmail 54371 invoked by uid 500); 28 Aug 2015 16:40:45 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 54324 invoked by uid 500); 28 Aug 2015 16:40:45 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 54309 invoked by uid 99); 28 Aug 2015 16:40:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Aug 2015 16:40:45 +0000 Date: Fri, 28 Aug 2015 16:40:45 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-14309) Allow load balancer to operate when there is region in transition by adding force flag MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-14309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720203#comment-14720203 ] stack commented on HBASE-14309: ------------------------------- bq. The assumption of this change is that the operator has run hbck and has a reasonable idea why regions are stuck in transition before using the force flag. There was nothing to this effect in patches until v6. Now it has below. 29 WARNING: Have you run hbck, etc, to determine the cause for region stuck in transition 30 before using the "force" flag ? 31 Examples: Should be more clear that it can do damage and make less reference to 'hbck', 'etc.', and 'rit'. 'For experts only. Forcing a balance may do more damage than repair when assignment is confused.' Would be good to then link to a section in refguide on plus/minus/implications. Agree it is good to expose tools to help in extreme. Dangerous options that may do more damage than good need proper couching with warning including justification for why we need this option when you open the issue. The original, la-de-dah text is: "This issue adds boolean parameter, force, to 'balancer' command so that admin can force region balancing even when there is region in transition - assuming RIT being transient." which comes across as you have nothing better to do all day but add options on commands when in fact, you have good cause. bq. In patch v6, added guard against system table region in transition even if force parameter carries value of true. What is this about? Why if operator thinks a force is needed, internally, we pass on it because its a system table in transition... Seems arbitrary. You provide no reasoning for the exclusion. > Allow load balancer to operate when there is region in transition by adding force flag > -------------------------------------------------------------------------------------- > > Key: HBASE-14309 > URL: https://issues.apache.org/jira/browse/HBASE-14309 > Project: HBase > Issue Type: Improvement > Reporter: Ted Yu > Assignee: Ted Yu > Fix For: 2.0.0, 1.3.0 > > Attachments: 14309-branch-1.1.txt, 14309-v1.txt, 14309-v2.txt, 14309-v3.txt, 14309-v4.txt, 14309-v5-branch-1.txt, 14309-v5.txt, 14309-v5.txt, 14309-v6.txt > > > This issue adds boolean parameter, force, to 'balancer' command so that admin can force region balancing even when there is region in transition - assuming RIT being transient. > This enhancement was requested by some customer. > The assumption of this change is that the operator has run hbck and has a reasonable idea why regions are stuck in transition before using the force flag. > There was a recent event at the customer where a cluster ended up with a small number of regionservers hosting most of the regions on the cluster (one regionserver had 50% of the roughly 20,000 regions). The balancer couldn't be run due to the small number of regions that were stuck in transition. The admin ended up killing the regionservers so that reassignment would yield a more equitable distribution of the regions. > On a different cluster, there was a single store file that had corrupt HDFS blocks (the SSDs on the cluster were known to lose data). However, since this single region (out of 10s of 1000s of regions on this cluster) was stuck in transition, the balancer couldn't run. > While the state keeping in HBase isn't so good yet that the admin can kick off the balancer automatically in such scenarios knowing when it is safe to do so and when it is not, having this option available for the operator to use as he / she sees fit seems prudent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)