Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@locus.apache.org Received: (qmail 56124 invoked from network); 13 Jan 2009 18:58:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Jan 2009 18:58:25 -0000 Received: (qmail 25096 invoked by uid 500); 13 Jan 2009 18:58:25 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 24929 invoked by uid 500); 13 Jan 2009 18:58:24 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 24917 invoked by uid 99); 13 Jan 2009 18:58:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jan 2009 10:58:24 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jan 2009 18:58:23 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2FC8A234C4B6 for ; Tue, 13 Jan 2009 10:58:03 -0800 (PST) Message-ID: <1399169525.1231873083192.JavaMail.jira@brutus> Date: Tue, 13 Jan 2009 10:58:03 -0800 (PST) From: "stack (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-1124) Balancer kicks in way too early In-Reply-To: <1726092921.1231832339934.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663421#action_12663421 ] stack commented on HBASE-1124: ------------------------------ Looking at Andrew's logs, you're both 'right'. Yes, balancer doesn't cut in till regions are all assigned only, when big cluster there is a big gap between all assigned and all open. In this gap, I see in Andrew's log the balancer cutting in. We don't want it working here while all regionservers have a big queue of region opens that they are currently working on. Here is an example. All regions have been handed out and master is just waiting on the opens to come in. {code} .... 009-01-13 06:57:09,006 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: result_domain,com.chawlk,1231796870012 from XX.XX.XX.37:60020 2009-01-13 06:57:09,006 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,28e2ec17934b05f11a77a88b1528d905,1231822159077 from XX.XX.XX.37:60020 2009-01-13 06:57:09,006 DEBUG org.apache.hadoop.hbase.master.RegionManager: Server 10.30.94.37:60020 is overloaded. Server load: 26 avg: 21.0, slop: 0.2 2009-01-13 06:57:09,006 DEBUG org.apache.hadoop.hbase.master.RegionManager: Choosing to reassign 5 regions. mostLoadedRegions has 10 regions in it. 2009-01-13 06:57:09,006 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region content,afebbf5e615585830ebe6f74e1014f3d,1231766212960 2009-01-13 06:57:09,006 INFO org.apache.hadoop.hbase.master.RegionManager: Skipped 9 region(s) that are in transition states ... {code} Above we are closing 'content,afebbf5e615585830ebe6f74e1014f3d,1231766212960' which had just opened 3 seconds earlier. About 5% of all regions assigned have reported back as opened. We shouldn't be balancing at this time. > Balancer kicks in way too early > ------------------------------- > > Key: HBASE-1124 > URL: https://issues.apache.org/jira/browse/HBASE-1124 > Project: Hadoop HBase > Issue Type: Bug > Reporter: Andrew Purtell > Fix For: 0.19.0 > > > Balancer kicks in before all regions are assigned out. Causes confusion. Master won't accept OPENs from "overloaded" HRS. Master is slow to respond to UI and HRS during. Master sometimes takes too long to respond to a HRS heartbeat and so the HRS will reinit. This causes more confusion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.