hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bijieshan <bijies...@huawei.com>
Subject Re: One problem with LoadBalancer
Date Sat, 25 Jun 2011 02:01:05 GMT
Thanks J-D.

I have filed an issue and attached the logs:
https://issues.apache.org/jira/browse/HBASE-4031

You can check the logs whether they can give you all the missing information.

>> What happened to the first master?

We killed the active one and let the standby became the active one. For we took some tests
on the Master-switch.

>> How come 1306205940117 went from 5841 regions to 0?

This regionserver got some exceptions and aborted. It seemed that there's no master during
the time, so no ServerShutdownHandler process happened.

Jieshan Bean.


---------------------------------------------------------------------------

I feel like I'm missing too much information to be helpful, for
example when the standby master comes up it needs to 13134 RIT. What
happened there? I thought the regions were all assigned? What happened
to the first master? How come 1306205940117 whent from 5841 regions to
0?

Thx for filling the gaps,

J-D

On Thu, Jun 23, 2011 at 6:35 PM, bijieshan <bijieshan@huawei.com> wrote:
> Hi,
>
>  I found the problem while the cluster couldn't balance. One node's regions count is
the double of the other nodes. And it didn't move regions anymore:
>   Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165
> 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165
> 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165
> 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165
> Total:  servers: 4   requests=0, regions=13689
>
>
>  HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my
analysis on this problem.
>  But I'm afraid it's not the main cause of the problem.
>
>  There's one active master, one standby master, four regionservers in our cluster.
>
>>>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding:
master failover
>
>>>4 regionservers was registered in 222 one by one. Only one regionserver seemed
some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388,
userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671,
regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411,
userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117,
regionCount=0, userLoad=false
>
>>>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster
during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over
master needs to process 13134 regions in transition
>
>>>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
>
>>>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated
a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded
servers
>
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"
>
>
> Jieshan Bean
>
>
>

Mime
View raw message