Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 63C2A6B65 for ; Sat, 25 Jun 2011 01:45:12 +0000 (UTC) Received: (qmail 45032 invoked by uid 500); 25 Jun 2011 01:45:12 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 44987 invoked by uid 500); 25 Jun 2011 01:45:11 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 44979 invoked by uid 99); 25 Jun 2011 01:45:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jun 2011 01:45:11 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jun 2011 01:45:08 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id E7B4A42EEA0 for ; Sat, 25 Jun 2011 01:44:47 +0000 (UTC) Date: Sat, 25 Jun 2011 01:44:47 +0000 (UTC) From: "Jieshan Bean (JIRA)" To: issues@hbase.apache.org Message-ID: <1359158175.39871.1308966287945.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <107759114.39595.1308963287585.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HBASE-4031) An imbalance result calculated by LoadBalancer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-4031: -------------------------------- Attachment: HMaster222.rar > An imbalance result calculated by LoadBalancer > ---------------------------------------------- > > Key: HBASE-4031 > URL: https://issues.apache.org/jira/browse/HBASE-4031 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.3 > Reporter: Jieshan Bean > Fix For: 0.90.4 > > Attachments: HMaster222.rar > > > I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore: > Address Start Code Load > 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 > Total: servers: 4 requests=0, regions=13689 > HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem. > But I'm afraid it's not the main cause of the problem. > There's one active master, one standby master, four regionservers in our cluster. > >>10:57:41, the standby hamster 222 becomes the active one. > 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover > >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late. > 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true > 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true > 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true > 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false > >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time). > 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition > >>All the 13134 regions were opened, regions opened count in each server: > 158-1-101-222,20020,1306205940117 Count: 834 > 158-1-101-82,20020,1306205415714 Count: 4093 > 158-1-101-202,20020,1306205409671 Count: 4118 > 158-1-101-52,20020,1306205417261 Count: 4089 > >>The nearest balancer calculate results: > 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers > "5012" is an unimaginable number here, for it is larger than the average number "3424.5" -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira