Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E98E71767B for ; Tue, 13 Jan 2015 06:35:33 +0000 (UTC) Received: (qmail 98251 invoked by uid 500); 13 Jan 2015 06:35:35 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 98204 invoked by uid 500); 13 Jan 2015 06:35:35 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 98193 invoked by uid 99); 13 Jan 2015 06:35:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jan 2015 06:35:35 +0000 Date: Tue, 13 Jan 2015 06:35:35 +0000 (UTC) From: "Andrew Purtell (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-10528) DefaultBalancer selects plans to move regions onto draining nodes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274800#comment-14274800 ] Andrew Purtell commented on HBASE-10528: ---------------------------------------- I'm looking at the diff of the revert and I don't see anything suspicious. I can reproduce the timeouts we saw on Jenkins locally, but sometimes the test passes too, so it seems the test is unstable. > DefaultBalancer selects plans to move regions onto draining nodes > ----------------------------------------------------------------- > > Key: HBASE-10528 > URL: https://issues.apache.org/jira/browse/HBASE-10528 > Project: HBase > Issue Type: Bug > Affects Versions: 0.94.5 > Reporter: churro morales > Assignee: churro morales > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: 10528-1.0.addendum, HBASE-10528-0.94.patch, HBASE-10528-0.98.patch, HBASE-10528.patch, HBASE-10528.v2.patch > > > We have quite a large cluster > 100k regions, and we needed to isolate a region was very hot until we could push a patch. We put this region on its own regionserver and set it in the draining state. The default balancer was selecting regions to move to this cluster for its region plans. > It just so happened for other tables, the default load balancer was creating plans for the draining servers, even though they were not available to move regions to. Thus we were closing regions, then attempting to move them to the draining server then finding out its draining. > We had to disable the balancer to resolve this issue. > There are some approaches we can take here. > 1. Exclude draining servers altogether, don't even pass those into the load balancer from HMaster. > 2. We could exclude draining servers from ceiling and floor calculations where we could potentially skip load balancing because those draining servers wont be represented when deciding whether to balance. > 3. Along with #2 when assigning regions, we would skip plans to assign regions to those draining servers. > I am in favor of #1 which is simply removes servers as candidates for balancing if they are in the draining state. > But I would love to hear what everyone else thinks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)