Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 195DCDA3F for ; Thu, 18 Oct 2012 22:50:04 +0000 (UTC) Received: (qmail 88177 invoked by uid 500); 18 Oct 2012 22:50:03 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 88135 invoked by uid 500); 18 Oct 2012 22:50:03 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 88125 invoked by uid 99); 18 Oct 2012 22:50:03 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Oct 2012 22:50:03 +0000 Date: Thu, 18 Oct 2012 22:50:03 +0000 (UTC) From: "Hadoop QA (JIRA)" To: issues@hbase.apache.org Message-ID: <1955592332.66349.1350600603829.JavaMail.jiratomcat@arcas> In-Reply-To: <1009162098.45767.1342141536580.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479443#comment-13479443 ] Hadoop QA commented on HBASE-6389: ---------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549763/HBASE-6389_trunk_v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 15 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 82 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.backup.example.TestZooKeeperTableArchiveClient Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//console This message is automatically generated. > Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments > ---------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-6389 > URL: https://issues.apache.org/jira/browse/HBASE-6389 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.94.0, 0.96.0 > Reporter: Aditya Kishore > Assignee: Aditya Kishore > Priority: Critical > Fix For: 0.96.0 > > Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk_v2.patch, HBASE-6389_trunk_v2.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack > > > Continuing from HBASE-6375. > It seems I was mistaken in my assumption that changing the value of "hbase.master.wait.on.regionservers.mintostart" to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). > While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. > From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if "hbase.master.wait.on.regionservers.mintostart" has not reached. > Reading the current conditions of waitForRegionServers() clarifies it > {code:title=ServerManager.java (trunk rev:1360470)} > .... > 581 /** > 582 * Wait for the region servers to report in. > 583 * We will wait until one of this condition is met: > 584 * - the master is stopped > 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached > 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of > 587 * region servers is reached > 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND > 589 * there have been no new region server in for > 590 * 'hbase.master.wait.on.regionservers.interval' time > 591 * > 592 * @throws InterruptedException > 593 */ > 594 public void waitForRegionServers(MonitoredTask status) > 595 throws InterruptedException { > .... > .... > 612 while ( > 613 !this.master.isStopped() && > 614 slept < timeout && > 615 count < maxToStart && > 616 (lastCountChange+interval > now || count < minToStart) > 617 ){ > .... > {code} > So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. > As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. > To enforce the required quorum as specified by "hbase.master.wait.on.regionservers.mintostart" irrespective of timeout, these conditions need to be modified as following > {code:title=ServerManager.java} > .. > /** > * Wait for the region servers to report in. > * We will wait until one of this condition is met: > * - the master is stopped > * - the 'hbase.master.wait.on.regionservers.maxtostart' number of > * region servers is reached > * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND > * there have been no new region server in for > * 'hbase.master.wait.on.regionservers.interval' time AND > * the 'hbase.master.wait.on.regionservers.timeout' is reached > * > * @throws InterruptedException > */ > public void waitForRegionServers(MonitoredTask status) > .. > .. > int minToStart = this.master.getConfiguration(). > getInt("hbase.master.wait.on.regionservers.mintostart", 1); > int maxToStart = this.master.getConfiguration(). > getInt("hbase.master.wait.on.regionservers.maxtostart", Integer.MAX_VALUE); > if (maxToStart < minToStart) { > maxToStart = minToStart; > } > .. > .. > while ( > !this.master.isStopped() && > count < maxToStart && > (lastCountChange+interval > now || timeout > slept || count < minToStart) > ){ > .. > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira