hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jg...@apache.org>
Subject Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition
Date Fri, 08 Oct 2010 21:02:29 GMT


> On 2010-10-08 13:43:59, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
267
> > <http://review.cloudera.org/r/995/diff/1/?file=14445#file14445line267>
> >
> >     When would this happen?

   * <b>ZK State:  OFFLINE</b>
   * <p>A node can get into OFFLINE state if</p>
   * <ul>
   * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
   * <li>The Master is assigning the region to a RS before it sends RPC
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Master has assigned an enabled region but RS failed so a region is
   *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
   * <li>This seems to cover both cases?</li>
   * </ul>


> On 2010-10-08 13:43:59, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java, line 675
> > <http://review.cloudera.org/r/995/diff/1/?file=14448#file14448line675>
> >
> >     Don't we have this in AssignmentManager already?
> >     isRegionsInTransition I believe its called.
> >     
> >     There is white space added at end of the two @throws lines.

This tests ZK not the RIT map on the master.  So for unit tests, you're testing two different
things.  Since i'm mocking data up in ZK, i wanted to ensure nothing left in zk.


> On 2010-10-08 13:43:59, stack wrote:
> > trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java, line
462
> > <http://review.cloudera.org/r/995/diff/1/?file=14451#file14451line462>
> >
> >     What about the case where not all regions have been assigned -- say the master
was killed mid-startup before all regions mentioned in .META. had been assigned by master?
 There should be a fixup where we compare the difference?  Can we we even handle this case?
 We'd need to ask RSs what they are holding?

IMO we don't need to support this (for now).  I think it is acceptable that nothing can fail
during a startup.  If the master dies or an RS dies during initial startup, you have to restart.
 I think RS deaths may even work fine but I think it's okay to have a SPOF during startup.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/#review1496
-----------------------------------------------------------


On 2010-10-07 16:34:04, Jonathan Gray wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/995/
> -----------------------------------------------------------
> 
> (Updated 2010-10-07 16:34:04)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> First go at a unit test of master failover with regions in transition.
> 
> Comment from the test method:
> 
>   /**
>    * Complex test of master failover that tests as many permutations of the
>    * different possible states that regions in transition could be in within ZK.
>    * <p>
>    * This tests the proper handling of these states by the failed-over master
>    * and includes a thorough testing of the timeout code as well.
>    * <p>
>    * Starts with a single master and three regionservers.
>    * <p>
>    * Creates two tables, enabledTable and disabledTable, each containing 5
>    * regions.  The disabledTable is then disabled.
>    * <p>
>    * After reaching steady-state, the master is killed.  We then mock several
>    * states in ZK.
>    * <p>
>    * After mocking them, we will startup a new master which should become the
>    * active master and also detect that it is a failover.  The primary test
>    * passing condition will be that all regions of the enabled table are
>    * assigned and all the regions of the disabled table are not assigned.
>    * <p>
>    * The different scenarios to be tested are below:
>    * <p>
>    * <b>ZK State:  OFFLINE</b>
>    * <p>A node can get into OFFLINE state if</p>
>    * <ul>
>    * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
>    * <li>The Master is assigning the region to a RS before it sends RPC
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Master has assigned an enabled region but RS failed so a region is
>    *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
>    * <li>This seems to cover both cases?</li>
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSING</b>
>    * <p>A node can get into CLOSING state if</p>
>    * <ul>
>    * <li>An RS has begun to close a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region was being closed but the RS died before finishing the close
>    * <li>Region of enabled table was being closed but did not complete
>    * <li>Region of disabled table was being closed but did not complete
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSED</b>
>    * <p>A node can get into CLOSED state if</p>
>    * <ul>
>    * <li>An RS has completed closing a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was closed on an RS
>    * <li>Region of a table that should be disabled was closed on an RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENING</b>
>    * <p>A node can get into OPENING state if</p>
>    * <ul>
>    * <li>An RS has begun to open a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>RS was opening a region of enabled table but never finishes
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENED</b>
>    * <p>A node can get into OPENED state if</p>
>    * <ul>
>    * <li>An RS has finished opening a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was opened on an RS
>    * <li>Region of a table that should be disabled was opened on an RS
>    * <li>Region of a table that should be enabled was opened by a now-dead RS
>    * <li>Region of a table that should be disabled was opened by a now-dead RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  NONE</b>
>    * <p>A region could not have a transition node if</p>
>    * <ul>
>    * <li>The server hosting the region died and no master processed it
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of enabled table was on a dead RS that was not yet processed
>    * <li>Region of disabled table was on a dead RS that was not yet processed
>    * </ul>
>    * @throws Exception
>    */
> 
> 
> This addresses bug HBASE-2700.
>     http://issues.apache.org/jira/browse/HBASE-2700
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1005264

> 
> Diff: http://review.cloudera.org/r/995/diff
> 
> 
> Testing
> -------
> 
> running the unit test!
> 
> 
> Thanks,
> 
> Jonathan
> 
>


Mime
View raw message