Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@locus.apache.org Received: (qmail 79155 invoked from network); 18 Jun 2008 20:28:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Jun 2008 20:28:36 -0000 Received: (qmail 92498 invoked by uid 500); 18 Jun 2008 20:28:38 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 92482 invoked by uid 500); 18 Jun 2008 20:28:38 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 92471 invoked by uid 99); 18 Jun 2008 20:28:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2008 13:28:38 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2008 20:27:57 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2051B234C14A for ; Wed, 18 Jun 2008 13:27:45 -0700 (PDT) Message-ID: <1318276826.1213820865131.JavaMail.jira@brutus> Date: Wed, 18 Jun 2008 13:27:45 -0700 (PDT) From: "Jim Kellerman (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-615) Region balancer oscillates during cluster startup In-Reply-To: <624420527.1210037635886.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606077#action_12606077 ] Jim Kellerman commented on HBASE-615: ------------------------------------- > Rong-En Fan - 11/Jun/08 09:43 AM > Please correct me if I am wrong or overlook something. > > During startup, META will be requested more than other regions. Therefore, > the RegionServer that serves META will be considered more "loaded" than > others. So, we tends not to assign more regions to that one. However, > our rebalance algo currently considers only # of loaded regions as the "load" > for region servers. That's the cause of oscillation at startup. > > I'm thinking of the possibility that during startup, we just use assign > evenly to all region servers. Once this is stabilized, we start to consider > > 1. of requests as part of the server load. Moreover, the "# of requests" here > should be calculated from a period of time, otherwise, we may moving regions > just because some spikes. You are absolutely correct. During startup, the server hosting the meta region gets all the requests, so its one region gets multiplied by the number of requests giving a "load" that is far greater than all the other servers which are getting no requests and consequently their "load" == number of regions they are serving. Should be a fairly easy fix to ignore requests during startup. BTW, I verified this by changing HServerLoad.getLoad() to just return the number of regions. The cluster had all regions on-line within a couple of minutes and they were balanced. When the number of requests was factored in during startup, the cluster did not achieve a steady state for 1/2 hour (after which I gave up). just my 2 cents. > Region balancer oscillates during cluster startup > ------------------------------------------------- > > Key: HBASE-615 > URL: https://issues.apache.org/jira/browse/HBASE-615 > Project: Hadoop HBase > Issue Type: Bug > Components: master > Affects Versions: 0.2.0 > Reporter: Jim Kellerman > Assignee: Bryan Duxbury > Priority: Blocker > Fix For: 0.2.0 > > Attachments: 615-lite.patch > > > When starting a cluster with four region servers and a large table (49 regions) (+root +meta) = 51 total regions, the region balancer oscillates for a very long time and does not seem to reach a steady state. > Additionally, for whatever reason, it seems reluctant to assign regions to the first of four region servers, which may be the root cause. In my test, the first server had 10 regions assigned, the second and fourth had 13 regions assigned, and the master would continually assign and deassign 2 regions to the third server, which oscillated between 13 and 15 regions. If it assigned the two fluctuating regions to the first server, it would achieve the best balance possible: 12, 13, 13, 13. > After 20 minutes, it had not stopped oscillating. An application trying to work against this cluster would run very slowly as it would be continually re-finding the two regions in flux. > When the table was being created, regions were nicely balanced. On restart, however, it just would not settle down. > Perhaps the balancer should set a target number of regions for each server which when the server achieved +/- 1 regions, the rebalancer would not try to change unless the number of regions changed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.