hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3420) Handling a big rebalance, we can queue multiple instances of a Close event; messes up state
Date Wed, 05 Jan 2011 18:12:47 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977890#action_12977890

stack commented on HBASE-3420:

Its timeout of a close.  Here is sequence:

2011-01-05 00:49:37,670 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.,
src=sv2borg181,60020,1294096110452, dest=sv2borg188,60020,1294187735582
2011-01-05 00:49:37,670 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment
of region TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. (offlining)
2011-01-05 00:49:37,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE
to serverName=sv2borg181,60020,1294096110452, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
for region TestTable,0487405776,1294125523541.                              b1fa38bb610943e9eadc604babe4d041.
2011-01-05 00:49:38,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x12d3de9e7c60e37
Retrieved 112 byte(s) of data from znode /hbase/unassigned/b1fa38bb610943e9eadc604babe4d041
and set watcher; region=TestTable,0487405776,1294125523541.           b1fa38bb610943e9eadc604babe4d041.,
server=sv2borg181,60020,1294096110452, state=RS_ZK_REGION_CLOSED
2011-01-05 00:49:38,385 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new
unassigned node: /hbase/unassigned/b1fa38bb610943e9eadc604babe4d041 (region=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.,
server=sv2borg181,60020,  1294096110452, state=RS_ZK_REGION_CLOSED)
2011-01-05 00:49:38,385 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED,
server=sv2borg181,60020,1294096110452, region=b1fa38bb610943e9eadc604babe4d041
2011-01-05 00:50:12,412 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in
transition timed out:  TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.
state=CLOSED, ts=1294188578211
2011-01-05 00:50:12,412 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has
been CLOSED for too long, retriggering ClosedRegionHandler

> Handling a big rebalance, we can queue multiple instances of a Close event; messes up
> -------------------------------------------------------------------------------------------
>                 Key: HBASE-3420
>                 URL: https://issues.apache.org/jira/browse/HBASE-3420
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: stack
>             Fix For: 0.90.1
> This is pretty ugly.  In short, on a heavily loaded cluster, we are queuing multiple
instances of region close.  They all try to run confusing state.
> Long version:
> I have a messy cluster.  Its 16k regions on 8 servers.  One node has 5k or so regions
on it.  Heaps are 1G all around.  My master had OOME'd.  Not sure why but not too worried
about it for now.  So, new master comes up and is trying to rebalance the cluster:
> {code}
> 2011-01-05 00:48:07,385 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated
a load balance in 14ms. Moving 3666 regions off of 6 overloaded servers onto 3 less loaded
> {code}
> The balancer ends up sending many closes to a single overloaded server are taking so
long, the close times out in RIT.  We then do this:
> {code}
>               case CLOSED:
>                 LOG.info("Region has been CLOSED for too long, " +
>                     "retriggering ClosedRegionHandler");
>                 AssignmentManager.this.executorService.submit(
>                     new ClosedRegionHandler(master, AssignmentManager.this,
>                         regionState.getRegion()));
>                 break;
> {code}
> We queue a new close (Should we?).
> We time out a few more times (9 times) and each time we queue a new close.
> Eventually the close succeeds, the region gets assigned a new location.
> Then the next close pops off the eventhandler queue.
> Here is the telltale signature of stuff gone amiss:
> {code}
> 2011-01-05 00:52:19,379 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing
OFFLINE; was=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. state=OPEN,
> {code}
> Notice how state is OPEN when we are forcing offline (It was actually just successfully
opened).  We end up assigning same server because plan was still around:
> {code}
> 2011-01-05 00:52:20,705 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Attempted open of TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. but
already online on this server
> {code}
> But later when plan is cleared, we assign new server and we have dbl-assignment.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message