hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17565) StochasticLoadBalancer may incorrectly skip balancing due to skewed multiplier sum
Date Fri, 03 Feb 2017 19:06:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851954#comment-15851954
] 

Enis Soztutar commented on HBASE-17565:
---------------------------------------

I think we should add a couple of mock clusters which are like the cluster described in the
description, something like {{500, 500, 500, 500, 500, 0}},  {{1500, 500, 500, 500, 10, 0}},
{{1500, 500, 500, 10, 10, 0}}, etc. 

TestStochasticLoadBalancer has very similar tests already, so we can extend the tests to reproduce
the case of needsBalance() not working correctly without the patch, and fixing the test with
the patch. 

> StochasticLoadBalancer may incorrectly skip balancing due to skewed multiplier sum
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-17565
>                 URL: https://issues.apache.org/jira/browse/HBASE-17565
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0
>
>         Attachments: 17565.v1.txt, 17565.v2.txt
>
>
> I was investigating why a 6 node cluster kept skipping balancing requests.
> Here were the region counts on the servers:
> 449, 448, 447, 449, 453, 0
> {code}
> 2017-01-26 22:04:47,145 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=1,queue=0,port=16000]
balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost
is 127.0171157050385, sum multiplier is 111087.0 min cost which need balance is 0.05
> {code}
> The big multiplier sum caught my eyes. Here was what additional debug logging showed:
> {code}
> 2017-01-27 23:25:31,749 DEBUG [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000]
balancer.StochasticLoadBalancer: class org.apache.hadoop.hbase.master.balancer.          StochasticLoadBalancer$RegionReplicaHostCostFunction
with multiplier 100000.0
> 2017-01-27 23:25:31,749 DEBUG [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000]
balancer.StochasticLoadBalancer: class org.apache.hadoop.hbase.master.balancer.          StochasticLoadBalancer$RegionReplicaRackCostFunction
with multiplier 10000.0
> {code}
> Note however, that no table in the cluster used read replica.
> I can think of two ways of fixing this situation:
> 1. If there is no read replica in the cluster, ignore the multipliers for the above two
functions.
> 2. When cost() returned by the CostFunction is 0 (or very very close to 0.0), ignore
the multiplier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message