hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-8119) Optimize StochasticLoadBalancer
Date Tue, 19 Mar 2013 01:41:15 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605847#comment-13605847
] 

Enis Soztutar edited comment on HBASE-8119 at 3/19/13 1:40 AM:
---------------------------------------------------------------

Quoting review at https://reviews.apache.org/r/9998/ 
Attaching a patch for improving the running time of StochasticLoadBalancer 200x times. 

TestStochasticLoadBalancer#testMidCluster() Current impl:
//2013-03-15 17:28:25,495 DEBUG [main] balancer.StochasticLoadBalancer(256): Finished computing
new laod balance plan.  Computation took 172526ms to try 15000 different iterations.  Found
a solution that moves 600 regions; Going from a computed cost of 35.85000000000001 to a new
cost of 23.481578947368426
With patch:
//2013-03-18 14:56:13,541 DEBUG [Thread-2] balancer.StochasticLoadBalancer(436): Finished
computing new laod balance plan.  Computation took 941ms to try 15000 different iterations.
 Found a solution that moves 600 regions; Going from a computed cost of 35.85 to a new cost
of 23.48157894736842

The improvements come from: 
 - Optimized array based data structures in Cluster class
 - Getting rid of hashmaps 
 - Optimized region move and swap ops 
 - Removing most of the computation to cluster initialization, and state change for the cluster,
thus eliminating computing the same results over and over
 - Some profiling

There should be further optimizations but this should be a good start. If we ran into more
problems, we can investigate further. There are a lof of TODO's added in this patch. I'll
create a jira for collecting some thoughts, but I wont have the time to work on those for
now. 

There are (hopefully) minor semantic changes in the algo. I had to bump up loadMultiplier,
and decrease moveCostMultiplier. See comments at TestStochasticLoadBalancer#testLargeCluster().
Please review carefully. 

As noted in testLargeCluster(), this does not work for large clusters > 100000 regions,
1000 nodes. This can be solved by smt like http://en.wikipedia.org/wiki/Simulated_annealing
instead of random walk with eager selection. 
                
      was (Author: enis):
    Quoting review at https://reviews.apache.org/r/9998/: 
Attaching a patch for improving the running time of StochasticLoadBalancer 200x times. 

TestStochasticLoadBalancer#testMidCluster() Current impl:
//2013-03-15 17:28:25,495 DEBUG [main] balancer.StochasticLoadBalancer(256): Finished computing
new laod balance plan.  Computation took 172526ms to try 15000 different iterations.  Found
a solution that moves 600 regions; Going from a computed cost of 35.85000000000001 to a new
cost of 23.481578947368426
With patch:
//2013-03-18 14:56:13,541 DEBUG [Thread-2] balancer.StochasticLoadBalancer(436): Finished
computing new laod balance plan.  Computation took 941ms to try 15000 different iterations.
 Found a solution that moves 600 regions; Going from a computed cost of 35.85 to a new cost
of 23.48157894736842

The improvements come from: 
 - Optimized array based data structures in Cluster class
 - Getting rid of hashmaps 
 - Optimized region move and swap ops 
 - Removing most of the computation to cluster initialization, and state change for the cluster,
thus eliminating computing the same results over and over
 - Some profiling

There should be further optimizations but this should be a good start. If we ran into more
problems, we can investigate further. There are a lof of TODO's added in this patch. I'll
create a jira for collecting some thoughts, but I wont have the time to work on those for
now. 

There are (hopefully) minor semantic changes in the algo. I had to bump up loadMultiplier,
and decrease moveCostMultiplier. See comments at TestStochasticLoadBalancer#testLargeCluster().
Please review carefully. 

As noted in testLargeCluster(), this does not work for large clusters > 100000 regions,
1000 nodes. This can be solved by smt like http://en.wikipedia.org/wiki/Simulated_annealing
instead of random walk with eager selection. 
                  
> Optimize StochasticLoadBalancer
> -------------------------------
>
>                 Key: HBASE-8119
>                 URL: https://issues.apache.org/jira/browse/HBASE-8119
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.95.0
>            Reporter: Enis Soztutar
>             Fix For: 0.95.0
>
>
> On a 5 node trunk cluster, I ran into a weird problem with StochasticLoadBalancer:
> server1 	Thu Mar 14 03:42:50 UTC 2013 	0.0 	33
> server2 	Thu Mar 14 03:47:53 UTC 2013 	0.0 	34
> server3 	Thu Mar 14 03:46:53 UTC 2013 	465.0 	42
> server4 	Thu Mar 14 03:47:53 UTC 2013 	11455.0 	282
> server5 	Thu Mar 14 03:47:53 UTC 2013 	0.0 	34
> Total:5 		11920 	425
> Notice that server4 has 282 regions, while the others have much less. Plus for one table
with 260 regions has been super imbalanced:
> {code}
> Regions by Region Server
> Region Server	Region Count
> http://server3:60030/ 	10
> http://server4:60030/ 	250
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message