Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0BCDF200C33 for ; Sat, 4 Feb 2017 05:30:58 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 0A5D3160B3F; Sat, 4 Feb 2017 04:30:58 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5BF7B160B55 for ; Sat, 4 Feb 2017 05:30:57 +0100 (CET) Received: (qmail 28142 invoked by uid 500); 4 Feb 2017 04:30:56 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 28038 invoked by uid 99); 4 Feb 2017 04:30:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Feb 2017 04:30:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id D35B0C05E9 for ; Sat, 4 Feb 2017 04:30:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id kP2Hm1aFSC4B for ; Sat, 4 Feb 2017 04:30:55 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id AB3985F30B for ; Sat, 4 Feb 2017 04:30:54 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AD974E04DB for ; Sat, 4 Feb 2017 04:30:53 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 873D42528E for ; Sat, 4 Feb 2017 04:30:52 +0000 (UTC) Date: Sat, 4 Feb 2017 04:30:52 +0000 (UTC) From: "Ted Yu (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-17565) StochasticLoadBalancer may incorrectly skip balancing due to skewed multiplier sum MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 04 Feb 2017 04:30:58 -0000 [ https://issues.apache.org/jira/browse/HBASE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852581#comment-15852581 ] Ted Yu commented on HBASE-17565: -------------------------------- bq. This only effect when there are a very very big multiplier...... Big multiplier does exist. We need to consider the aggregate effect. bq. Why we need a so big default multiplier for read replica? The background is that when any two replicas among primary, secondary and tertiary replicas are on the same server, we lose benefit of read replica when this server goes down. See the javadoc for RegionReplicaHostCostFunction : {code} * A cost function for region replicas. We give a very high cost to hosting * replicas of the same region in the same host. We do not prevent the case * though, since if numReplicas > numRegionServers, we still want to keep the * replica open. {code} > StochasticLoadBalancer may incorrectly skip balancing due to skewed multiplier sum > ---------------------------------------------------------------------------------- > > Key: HBASE-17565 > URL: https://issues.apache.org/jira/browse/HBASE-17565 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Critical > Fix For: 2.0.0, 1.4.0 > > Attachments: 17565.v1.txt, 17565.v2.txt, 17565.v3.txt > > > I was investigating why a 6 node cluster kept skipping balancing requests. > Here were the region counts on the servers: > 449, 448, 447, 449, 453, 0 > {code} > 2017-01-26 22:04:47,145 INFO [RpcServer.deafult.FPBQ.Fifo.handler=1,queue=0,port=16000] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 127.0171157050385, sum multiplier is 111087.0 min cost which need balance is 0.05 > {code} > The big multiplier sum caught my eyes. Here was what additional debug logging showed: > {code} > 2017-01-27 23:25:31,749 DEBUG [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] balancer.StochasticLoadBalancer: class org.apache.hadoop.hbase.master.balancer. StochasticLoadBalancer$RegionReplicaHostCostFunction with multiplier 100000.0 > 2017-01-27 23:25:31,749 DEBUG [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] balancer.StochasticLoadBalancer: class org.apache.hadoop.hbase.master.balancer. StochasticLoadBalancer$RegionReplicaRackCostFunction with multiplier 10000.0 > {code} > Note however, that no table in the cluster used read replica. > I can think of two ways of fixing this situation: > 1. If there is no read replica in the cluster, ignore the multipliers for the above two functions. > 2. When cost() returned by the CostFunction is 0 (or very very close to 0.0), ignore the multiplier. -- This message was sent by Atlassian JIRA (v6.3.15#6346)