Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 497C6200BCA for ; Mon, 7 Nov 2016 06:10:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 4806F160B0D; Mon, 7 Nov 2016 05:10:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 910F4160AFC for ; Mon, 7 Nov 2016 06:09:59 +0100 (CET) Received: (qmail 15688 invoked by uid 500); 7 Nov 2016 05:09:58 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 15665 invoked by uid 99); 7 Nov 2016 05:09:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2016 05:09:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 5DA312C1F56 for ; Mon, 7 Nov 2016 05:09:58 +0000 (UTC) Date: Mon, 7 Nov 2016 05:09:58 +0000 (UTC) From: "Yu Li (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-17039) SimpleLoadBalancer schedules large amount of invalid region moves MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 07 Nov 2016 05:10:00 -0000 [ https://issues.apache.org/jira/browse/HBASE-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Li updated HBASE-17039: -------------------------- Description: After increasing one of our clusters to 1600 nodes, we observed a large amount of invalid region moves(more than 30k moves) fired by the balance chore. Thus we simulated the problem and printed out the balance plan, only to find out many servers that had two regions for a certain table(we use by table strategy), sent out both regions to other two servers that have zero region. In the SimpleLoadBalancer's balanceCluster function, the code block that determines the underLoadedServers might have a problem: {code} if (load >= min && load > 0) { continue; // look for other servers which haven't reached min } int regionsToPut = min - load; if (regionsToPut == 0) { regionsToPut = 1; } {code} if min is zero, some server that has load of zero, which equals to min would be marked as underloaded, which would cause the phenomenon mentioned above. Since we increased the cluster's size to 1600+, many tables that only have 1000 regions, now would encounter such issue. By fixing it up, the balance plan went back to normal. was: After increasing one of our clusters to 1600 nodes, we observed a large amount of invalid region moves(more than 30k moves) fired by the balance chore. Thus we simulated the problem and printed out the balance plan, only to find out many servers that had two regions for a certain table(we use by table strategy), sent out both regions to other two servers that have zero region. In the SimpleLoadBalancer's balanceCluster function, the code block that determines the underLoadedServers might have a problem: if (load >= min && load > 0) { continue; // look for other servers which haven't reached min } int regionsToPut = min - load; if (regionsToPut == 0) { regionsToPut = 1; } if min is zero, some server that has load of zero, which equals to min would be marked as underloaded, which would cause the phenomenon mentioned above. Since we increased the cluster's size to 1600+, many tables that only have 1000 regions, now would encounter such issue. By fixing it up, the balance plan went back to normal. > SimpleLoadBalancer schedules large amount of invalid region moves > ----------------------------------------------------------------- > > Key: HBASE-17039 > URL: https://issues.apache.org/jira/browse/HBASE-17039 > Project: HBase > Issue Type: Bug > Components: Balancer > Affects Versions: 2.0.0, 1.1.6, 1.2.3 > Reporter: Charlie Qiangeng Xu > Assignee: Charlie Qiangeng Xu > Fix For: 2.0.0, 1.1.6, 1.2.3 > > > After increasing one of our clusters to 1600 nodes, we observed a large amount of invalid region moves(more than 30k moves) fired by the balance chore. Thus we simulated the problem and printed out the balance plan, only to find out many servers that had two regions for a certain table(we use by table strategy), sent out both regions to other two servers that have zero region. > In the SimpleLoadBalancer's balanceCluster function, > the code block that determines the underLoadedServers might have a problem: > {code} > if (load >= min && load > 0) { > continue; // look for other servers which haven't reached min > } > int regionsToPut = min - load; > if (regionsToPut == 0) > { > regionsToPut = 1; > } > {code} > if min is zero, some server that has load of zero, which equals to min would be marked as underloaded, which would cause the phenomenon mentioned above. > Since we increased the cluster's size to 1600+, many tables that only have 1000 regions, now would encounter such issue. > By fixing it up, the balance plan went back to normal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)