Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AD7D46CB1 for ; Fri, 24 Jun 2011 18:46:52 +0000 (UTC) Received: (qmail 2201 invoked by uid 500); 24 Jun 2011 18:46:51 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 2065 invoked by uid 500); 24 Jun 2011 18:46:51 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 2056 invoked by uid 99); 24 Jun 2011 18:46:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 18:46:51 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jdcryans@gmail.com designates 209.85.213.41 as permitted sender) Received: from [209.85.213.41] (HELO mail-yw0-f41.google.com) (209.85.213.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 18:46:46 +0000 Received: by ywb26 with SMTP id 26so1578775ywb.14 for ; Fri, 24 Jun 2011 11:46:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=OzGv3+kxh3ObESq/a8k80j0Em9vSNK6AX4Haf6XDofQ=; b=GtGu9KdQWcpeHB/MDzERHxpabrw88F4a6WlEJC72EI7FI0tlxyw7RHpsbzVGOac+Aw Tt3JpLirppztyk0YNRrMRbJ97eRiHXM+ObJuQzhqs5ORGFNm5H6e7ZUdsM7PSDgBtDuO +V81rFUdI8RKGAIfZvw9w2M1yL9kSLSJf3EmA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=ocZoOVbp7b70OUb27Jr2iARNYdhfvncU5YyCLXdzktsTh2Ljn91KwbUy7WXRbg1+qt UAeCLrYvRfNLDupwrkfOuNzHqR3RIuw/u1rk4/fkxhsZcWPX+iCfcopNwpfW4aQYzbSE EFSGKOBNo65JvA6cwHimr7OTyDPsaD9+820h4= MIME-Version: 1.0 Received: by 10.101.105.27 with SMTP id h27mr3932062anm.118.1308941184793; Fri, 24 Jun 2011 11:46:24 -0700 (PDT) Sender: jdcryans@gmail.com Received: by 10.100.226.14 with HTTP; Fri, 24 Jun 2011 11:46:24 -0700 (PDT) In-Reply-To: <12FBA326CCB5D446B61A2DDDCB41E42013CA9D@SZXEML504-MBX.china.huawei.com> References: <12FBA326CCB5D446B61A2DDDCB41E42013CA9D@SZXEML504-MBX.china.huawei.com> Date: Fri, 24 Jun 2011 11:46:24 -0700 X-Google-Sender-Auth: EHxcVkpd8ktLSLtpuOSNgOAy3UM Message-ID: Subject: Re: One problem with LoadBalancer From: Jean-Daniel Cryans To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I feel like I'm missing too much information to be helpful, for example when the standby master comes up it needs to 13134 RIT. What happened there? I thought the regions were all assigned? What happened to the first master? How come 1306205940117 whent from 5841 regions to 0? Thx for filling the gaps, J-D On Thu, Jun 23, 2011 at 6:35 PM, bijieshan wrote: > Hi, > > =A0I found the problem while the cluster couldn't balance. One node's reg= ions count is the double of the other nodes. And it didn't move regions any= more: > =A0 Address Start Code Load > 158-1-101-202:20030 1306205409671 requests=3D0, regions=3D2593, usedHeap= =3D114, maxHeap=3D8165 > 158-1-101-222:20030 1306205940117 requests=3D0, regions=3D5841, usedHeap= =3D80, maxHeap=3D8165 > 158-1-101-52:20030 1306205417261 requests=3D0, regions=3D2622, usedHeap= =3D76, maxHeap=3D8165 > 158-1-101-82:20030 1306205415714 requests=3D0, regions=3D2633, usedHeap= =3D69, maxHeap=3D8165 > Total: =A0servers: 4 =A0 requests=3D0, regions=3D13689 > > > =A0HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was= found by my analysis on this problem. > =A0But I'm afraid it's not the main cause of the problem. > > =A0There's one active master, one standby master, four regionservers in o= ur cluster. > >>>10:57:41, the standby hamster 222 becomes the active one. > 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Mast= er startup proceeding: master failover > >>>4 regionservers was registered in 222 one by one. Only one regionserver = seemed some time late. > 2011-05-24 10:57:37,533 INFO : Registering server=3D158-1-101-82,20020,13= 06205415714, regionCount=3D3388, userLoad=3Dtrue > 2011-05-24 10:57:37,537 INFO : Registering server=3D158-1-101-202,20020,1= 306205409671, regionCount=3D3453, userLoad=3Dtrue > 2011-05-24 10:57:37,598 INFO : Registering server=3D158-1-101-52,20020,13= 06205417261, regionCount=3D3411, userLoad=3Dtrue > 2011-05-24 10:59:00,408 INFO : Registering server=3D158-1-101-222,20020,1= 306205940117, regionCount=3D0, userLoad=3Dfalse > >>>13134 regions needed to move after rebuildUserRegions(13689 regions in t= he cluster during the time). > 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentMan= ager: Failed-over master needs to process 13134 regions in transition > >>>All the 13134 regions were opened, regions opened count in each server: > 158-1-101-222,20020,1306205940117 =A0 =A0Count: 834 > 158-1-101-82,20020,1306205415714 =A0 =A0Count: 4093 > 158-1-101-202,20020,1306205409671 =A0 =A0Count: 4118 > 158-1-101-52,20020,1306205417261 =A0 =A0Count: 4089 > >>>The nearest balancer calculate results: > 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer:= Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded= servers onto 1 less loaded servers > > "5012" is an unimaginable number here, for it is larger than the average = number "3424.5" > > > Jieshan Bean > > >