hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3068) IllegalStateException when new server comes online, is given 200 regions to open and 200th region gets timed out of regions in transition
Date Fri, 01 Oct 2010 21:28:33 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917042#action_12917042

HBase Review Board commented on HBASE-3068:

Message from: stack@duboce.net

This is an automatically generated e-mail. To reply, visit:

(Updated 2010-10-01 14:05:45.726776)

Review request for hbase and Jonathan Gray.


Update to javadoc and comments


Fix is two-fold.

First, added new facility where on successful open, we go and update the timers on all regions
in transition that were on the same server.

Secondly, in the timeout monitor, we'll do necessary cleanup and state transitions so that
when we go into re-assign, we have the proper state

M src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
  Changed regionPlans to be a concurrentskiplist.  Makes life easier
  and in no place do we need lock on regionPlans to span other than
  regionPlans changes.
  Added to the processing of successful region open, the cleanup
  of its regionPlan and a run of updateTimers.
  Put setOffline in place of some code that duplicated what it did.

This addresses bug hbase-3068.

Diffs (updated)

  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1003330 

Diff: http://review.cloudera.org/r/930/diff


Basic unit tests seem to be passing.  Testing now up on cluster.



> IllegalStateException when new server comes online, is given 200 regions to open and
200th region gets timed out of regions in transition
> -----------------------------------------------------------------------------------------------------------------------------------------
>                 Key: HBASE-3068
>                 URL: https://issues.apache.org/jira/browse/HBASE-3068
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.90.0
> Yesterday we committed a change that makes it so the master will crash is a zk transition
that is unexpected.   Its extreme but good for highlighting bad state changes (we also started
marking these as illegalstateexceptions yesterday too).
> So, testing new master I brought up a new server.  Balancer tried to give new server
256 regions.
> {code}
> 2010-10-01 16:01:42,972 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated
a load balance in 0ms. Moving 256 regions off of 7 overloaded servers onto 1 less loaded servers
> {code}
> Turns out we failed complete open of all 256 servers within the regions-in-transition
timeout period so we tried to reassign.  The master aborted because region was in the PENDING_OPEN
state when we went about assigning.
> {code}
> 2010-10-01 16:02:28,809 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions
in transition timed out:  usertable,user1128734802,1285701924906.006696a9bf346f8593df66728e18e029.
state=PENDING_OPEN, ts=1285948921051
> 2010-10-01 16:02:28,809 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region
has been PENDING_OPEN or OPENING for too long, reassigning region=usertable,user1128734802,1285701924906.006696a9bf346f8593df66728e18e029.
> 2010-10-01 16:02:28,811 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state
trying to OFFLINE; usertable,user1128734802,1285701924906.006696a9bf346f8593df66728e18e029.
state=PENDING_OPEN, ts=1285948921051
> java.lang.IllegalStateException
>     at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:662)
>     at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:632)
>     at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:560)
>     at org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1102)
>     at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message