Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C956D1F3 for ; Wed, 31 Oct 2012 09:35:14 +0000 (UTC) Received: (qmail 61261 invoked by uid 500); 31 Oct 2012 09:35:14 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 61210 invoked by uid 500); 31 Oct 2012 09:35:14 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 61165 invoked by uid 99); 31 Oct 2012 09:35:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Oct 2012 09:35:13 +0000 Date: Wed, 31 Oct 2012 09:35:13 +0000 (UTC) From: "Aditya Kishore (JIRA)" To: issues@hbase.apache.org Message-ID: <460794992.49652.1351676113633.JavaMail.jiratomcat@arcas> In-Reply-To: <1009162098.45767.1342141536580.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Kishore updated HBASE-6389: ---------------------------------- Attachment: HBASE-6389_0.94.patch Attaching patch for 0.94 branch. The patch passes full test suit on my test machine. > Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments > ---------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-6389 > URL: https://issues.apache.org/jira/browse/HBASE-6389 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.94.0, 0.96.0 > Reporter: Aditya Kishore > Assignee: Aditya Kishore > Priority: Critical > Fix For: 0.94.3, 0.96.0 > > Attachments: HBASE-6389_0.94.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk_v2.patch, HBASE-6389_trunk_v2.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack > > > Continuing from HBASE-6375. > It seems I was mistaken in my assumption that changing the value of "hbase.master.wait.on.regionservers.mintostart" to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). > While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. > From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if "hbase.master.wait.on.regionservers.mintostart" has not reached. > Reading the current conditions of waitForRegionServers() clarifies it > {code:title=ServerManager.java (trunk rev:1360470)} > .... > 581 /** > 582 * Wait for the region servers to report in. > 583 * We will wait until one of this condition is met: > 584 * - the master is stopped > 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached > 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of > 587 * region servers is reached > 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND > 589 * there have been no new region server in for > 590 * 'hbase.master.wait.on.regionservers.interval' time > 591 * > 592 * @throws InterruptedException > 593 */ > 594 public void waitForRegionServers(MonitoredTask status) > 595 throws InterruptedException { > .... > .... > 612 while ( > 613 !this.master.isStopped() && > 614 slept < timeout && > 615 count < maxToStart && > 616 (lastCountChange+interval > now || count < minToStart) > 617 ){ > .... > {code} > So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. > As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. > To enforce the required quorum as specified by "hbase.master.wait.on.regionservers.mintostart" irrespective of timeout, these conditions need to be modified as following > {code:title=ServerManager.java} > .. > /** > * Wait for the region servers to report in. > * We will wait until one of this condition is met: > * - the master is stopped > * - the 'hbase.master.wait.on.regionservers.maxtostart' number of > * region servers is reached > * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND > * there have been no new region server in for > * 'hbase.master.wait.on.regionservers.interval' time AND > * the 'hbase.master.wait.on.regionservers.timeout' is reached > * > * @throws InterruptedException > */ > public void waitForRegionServers(MonitoredTask status) > .. > .. > int minToStart = this.master.getConfiguration(). > getInt("hbase.master.wait.on.regionservers.mintostart", 1); > int maxToStart = this.master.getConfiguration(). > getInt("hbase.master.wait.on.regionservers.maxtostart", Integer.MAX_VALUE); > if (maxToStart < minToStart) { > maxToStart = minToStart; > } > .. > .. > while ( > !this.master.isStopped() && > count < maxToStart && > (lastCountChange+interval > now || timeout > slept || count < minToStart) > ){ > .. > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira