Return-Path: Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: (qmail 42199 invoked from network); 11 Aug 2010 04:39:24 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Aug 2010 04:39:24 -0000 Received: (qmail 83683 invoked by uid 500); 11 Aug 2010 04:39:24 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 83429 invoked by uid 500); 11 Aug 2010 04:39:22 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 83418 invoked by uid 99); 11 Aug 2010 04:39:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 04:39:21 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com designates 74.125.82.169 as permitted sender) Received: from [74.125.82.169] (HELO mail-wy0-f169.google.com) (74.125.82.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 04:39:14 +0000 Received: by wyg36 with SMTP id 36so13938724wyg.14 for ; Tue, 10 Aug 2010 21:38:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type:content-transfer-encoding; bh=lPBB1Cvth2WYs8PiU0KiHNFyOiYkGFVQR2zEIJypkIs=; b=wpAkpCoh85Bi3E3WZIruH0h4iBrkG+rjgx+MZmqZCutIInoelUQJCO6yvJvPY0+wx0 x3wjpuB/v1573puIbbIpj1kMzwPIpuj84R3By7Sgg/zepZaR2NEp7KtHBgxnHpEbBCsM W+xp/Jj3pqokkb9zrJQTSiIGydk1zQ0Do9GCE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=X+RhWMzE53jAVpdnm8SOE1Pg2n5reu7eM26Bxf0wYXiZ44D+Xe378ce2Iji54CgXmO 3ezWEYWN8vjgvoyRt/hriHGD6T2htBdIgY22EQj7IC+CY/f4nWNenKAo5WZjjY+MqoUF CYIylielfDpg3TfWHQaxwQuWW1WcABULfq+xg= MIME-Version: 1.0 Received: by 10.216.11.205 with SMTP id 55mr15981973wex.51.1281501533451; Tue, 10 Aug 2010 21:38:53 -0700 (PDT) Sender: saint.ack@gmail.com Received: by 10.216.168.73 with HTTP; Tue, 10 Aug 2010 21:38:53 -0700 (PDT) In-Reply-To: References: Date: Tue, 10 Aug 2010 21:38:53 -0700 X-Google-Sender-Auth: BIyClL2PQAjeo4PLfs88ub7qAW8 Message-ID: Subject: Re: load balancing considerations From: Stack To: dev@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Ted: You have 22 column families in your schema? Do you need that many? Run with less if you can because 22 CFs takes you into a category that not many hang out in. It may be at the root of the OOME. Otherwise, its the usual suspects -- a bad record perhaps? One that was incorrectly formatted so it had a very large size on it? Do you run w/ GC enabled? If not, try it. Apparently its near to frictionless. It might give us more clues. Also, when the RS crashes, it'll dump heap by default. Do you see it? If you put it someplace that I can pull, I'll take a look at it. St.Ack On Tue, Aug 10, 2010 at 9:30 PM, Ted Yu wrote: > We use 0.20.6 with HBASE-2473 > As you can see from the following region server log snippet, OOME happene= d > to this RS: > > 2010-08-11 03:59:12,760 INFO org.apache.hadoop.hbase.regionserver.HRegion= : > Blocking updates for 'IPC Server handler 17 on 60020' on region > 2__HB_NOINC_GRID_0809-THREEGPPSPEECHCALLS-1281499094297,\x0E\x0E\x0E\x0E\= x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E= \x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E,1281499095128: > memstore size 1.0g is >=3D than blocking 1.0g size > 2010-08-11 03:59:16,853 INFO org.apache.hadoop.hbase.regionserver.HRegion= : > Blocking updates for 'IPC Server handler 24 on 60020' on region > 2__HB_NOINC_GRID_0809-THREEGPPSPEECHCALLS-1281499094297,\x0E\x0E\x0E\x0E\= x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E= \x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E,1281499095128: > memstore size 1.0g is >=3D than blocking 1.0g size > 2010-08-11 03:59:44,524 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError, > aborting. > java.lang.OutOfMemoryError: Java heap space > =A0 =A0 =A0 =A0at java.nio.HeapByteBuffer.(HeapByteBuffer.java:39) > =A0 =A0 =A0 =A0at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) =A0 = =A0 =A0 =A0at > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseSe= rver.java:825) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:= 419) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318= ) > 2010-08-11 03:59:44,525 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > request=3D0.0, regions=3D9, stores=3D22, storefiles=3D4, storefileIndexSi= ze=3D5, > memstoreSize=3D1502, compactionQueueSize=3D0, usedHeap=3D*3929*, maxHeap= =3D3973, > blockCacheSize=3D6836104, blockCacheFree=3D826362424, blockCacheCount=3D0= , > blockCacheHitRatio=3D0, fsReadLatency=3D0, fsWriteLatency=3D0, fsSyncLate= ncy=3D0 > > Among the other RS, the highest usedHeap is 1750 > > On Sat, Jul 31, 2010 at 3:31 PM, Ryan Rawson wrote: > >> Hi, >> >> #3 is going to be tricky... due to the ebb And flow of the gc this value >> isn't as accurate as one would wish. Furthermore we flush nematodes base= d >> on >> ram pressure. >> >> Any algorithm would have to have the property of being stable and >> conservative... rebalancing is not a 0 impact operation. >> >> There are jiras open for the rebalance based on load. To date it hasn't >> been >> a practical problem here at SU in our prod clusters however. >> >> On Jul 31, 2010 3:18 PM, "Ted Yu" wrote: >> > Hi, >> > Currently load balancing only considers region count. >> > See ServerManager.getAverageLoad() >> > >> > I think load balancing should consider the following three factors for >> each >> > RS: >> > 1. number of regions it hosts >> > 2. number of requests it serves within given period >> > 3. how close usedHeap is to maxHeap >> > >> > Please comment how we should weigh the above three factors in deciding >> the >> > regions to offload from each RS. >> > >> > Thanks >> >