hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhihong Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
Date Fri, 20 Jul 2012 02:55:36 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418866#comment-13418866
] 

Zhihong Ted Yu edited comment on HBASE-6389 at 7/20/12 2:53 AM:
----------------------------------------------------------------

I ran test suite with latest patch on trunk and got:
{code}
Failed tests:   testRunThriftServer[12](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine):
expected:<1> but was:<0>
  testRunThriftServer[14](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1>
but was:<0>
  testRunThriftServer[15](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1>
but was:<0>
  testRunThriftServer[16](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1>
but was:<0>
  testRunThriftServer[17](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1>
but was:<0>

Tests in error:
  testRegionCaching(org.apache.hadoop.hbase.client.TestHCM): org.apache.hadoop.hbase.UnknownRegionException:
bd992463917ba68fe5389c5bf9e94a3a
  testCloseRegionThatFetchesTheHRIFromMeta(org.apache.hadoop.hbase.client.TestAdmin): -1
  testTableExists(org.apache.hadoop.hbase.catalog.TestMetaReaderEditor): org.apache.hadoop.hbase.TableNotEnabledException:
testTableExists
  testRunThriftServer[11](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed
out after 60000 milliseconds
  testRunThriftServer[13](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed
out after 60000 milliseconds
{code}
There was one hanging test:
{code}
	at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183)
{code}

BTW what do *R*~i~, C and *F*~i~ represent in the formula above ?
                
      was (Author: zhihyu@ebaysf.com):
    I ran test suite with latest patch on trunk and got:
{code}
Running org.apache.hadoop.hbase.client.TestHCM
Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.265 sec <<< FAILURE!
--
Running org.apache.hadoop.hbase.client.TestAdmin
Tests run: 40, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 322.872 sec <<<
FAILURE!
--
Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor
Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 134.193 sec <<< FAILURE!
--
Running org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine
Tests run: 20, Failures: 5, Errors: 2, Skipped: 0, Time elapsed: 669.588 sec <<<
FAILURE!
{code}
There was one hanging test:
{code}
	at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183)
{code}

BTW what do *R*~i~, C and *F*~i~ represent in the formula above ?
                  
> Modify the conditions to ensure that Master waits for sufficient number of Region Servers
before starting region assignments
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6389
>                 URL: https://issues.apache.org/jira/browse/HBASE-6389
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Aditya Kishore
>            Assignee: Aditya Kishore
>            Priority: Critical
>             Fix For: 0.96.0, 0.94.2
>
>         Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch,
org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack
>
>
> Continuing from HBASE-6375.
> It seems I was mistaken in my assumption that changing the value of "hbase.master.wait.on.regionservers.mintostart"
to a sufficient number (from default of 1) can help prevent assignment of all regions to one
(or a small number of) region server(s).
> While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards
to address HBASE-4993.
> From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even
if "hbase.master.wait.on.regionservers.mintostart" has not reached.
> Reading the current conditions of waitForRegionServers() clarifies it
> {code:title=ServerManager.java (trunk rev:1360470)}
> ....
> 581	  /**
> 582	   * Wait for the region servers to report in.
> 583	   * We will wait until one of this condition is met:
> 584	   *  - the master is stopped
> 585	   *  - the 'hbase.master.wait.on.regionservers.timeout' is reached
> 586	   *  - the 'hbase.master.wait.on.regionservers.maxtostart' number of
> 587	   *    region servers is reached
> 588	   *  - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND
> 589	   *   there have been no new region server in for
> 590	   *      'hbase.master.wait.on.regionservers.interval' time
> 591	   *
> 592	   * @throws InterruptedException
> 593	   */
> 594	  public void waitForRegionServers(MonitoredTask status)
> 595	  throws InterruptedException {
> ....
> ....
> 612	    while (
> 613	      !this.master.isStopped() &&
> 614	        slept < timeout &&
> 615	        count < maxToStart &&
> 616	        (lastCountChange+interval > now || count < minToStart)
> 617	      ){
> ....
> {code}
> So with the current conditions, the wait will end as soon as timeout is reached even
lesser number of RS have checked-in with the Master and the master will proceed with the region
assignment among these RSes alone.
> As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-,
and I concur, this could have disastrous effect in large cluster especially now that MSLAB
is turned on.
> To enforce the required quorum as specified by "hbase.master.wait.on.regionservers.mintostart"
irrespective of timeout, these conditions need to be modified as following
> {code:title=ServerManager.java}
> ..
>   /**
>    * Wait for the region servers to report in.
>    * We will wait until one of this condition is met:
>    *  - the master is stopped
>    *  - the 'hbase.master.wait.on.regionservers.maxtostart' number of
>    *    region servers is reached
>    *  - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND
>    *   there have been no new region server in for
>    *      'hbase.master.wait.on.regionservers.interval' time AND
>    *   the 'hbase.master.wait.on.regionservers.timeout' is reached
>    *
>    * @throws InterruptedException
>    */
>   public void waitForRegionServers(MonitoredTask status)
> ..
> ..
>     int minToStart = this.master.getConfiguration().
>     getInt("hbase.master.wait.on.regionservers.mintostart", 1);
>     int maxToStart = this.master.getConfiguration().
>     getInt("hbase.master.wait.on.regionservers.maxtostart", Integer.MAX_VALUE);
>     if (maxToStart < minToStart) {
>       maxToStart = minToStart;
>     }
> ..
> ..
>     while (
>       !this.master.isStopped() &&
>         count < maxToStart &&
>         (lastCountChange+interval > now || timeout > slept || count < minToStart)
>       ){
> ..
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message