Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com
 designates 74.125.82.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=X+RhWMzE53jAVpdnm8SOE1Pg2n5reu7eM26Bxf0wYXiZ44D+Xe378ce2Iji54CgXmO
         3ezWEYWN8vjgvoyRt/hriHGD6T2htBdIgY22EQj7IC+CY/f4nWNenKAo5WZjjY+MqoUF
         CYIylielfDpg3TfWHQaxwQuWW1WcABULfq+xg=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <AANLkTimFpNNibNVA+cCxB4sYysSW6uay4w1PUK+ODy0Y@mail.gmail.com>
References: <AANLkTi=e-bxngi+RXPa1j4QhpUp+BhkYrqP6CV+HC0WA@mail.gmail.com>
	<AANLkTinoqGpxWs5xuKnTxnxm1a=K90zJNd2Y7cKDyoVP@mail.gmail.com>
	<AANLkTimFpNNibNVA+cCxB4sYysSW6uay4w1PUK+ODy0Y@mail.gmail.com>
Date: Tue, 10 Aug 2010 21:38:53 -0700
Message-ID: <AANLkTin7oVDQZCfj5KYH58gxPvRsBTZtkZAvYqPDgxub@mail.gmail.com>
Subject: Re: load balancing considerations
From: Stack <stack@duboce.net>
To: dev@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Ted:

You have 22 column families in your schema?  Do you need that many?
Run with less if you can because 22 CFs takes you into a category that
not many hang out in.  It may be at the root of the OOME.

Otherwise, its the usual suspects -- a bad record perhaps?  One that
was incorrectly formatted so it had a very large size on it?

Do you run w/ GC enabled?  If not, try it.  Apparently its near to
frictionless.  It might give us more clues.

Also, when the RS crashes, it'll dump heap by default.  Do you see it?
 If you put it someplace that I can pull, I'll take a look at it.

St.Ack

On Tue, Aug 10, 2010 at 9:30 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> We use 0.20.6 with HBASE-2473
> As you can see from the following region server log snippet, OOME happene=
d
> to this RS:
>
> 2010-08-11 03:59:12,760 INFO org.apache.hadoop.hbase.regionserver.HRegion=
:
> Blocking updates for 'IPC Server handler 17 on 60020' on region
> 2__HB_NOINC_GRID_0809-THREEGPPSPEECHCALLS-1281499094297,\x0E\x0E\x0E\x0E\=
x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E=
\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E,1281499095128:
> memstore size 1.0g is >=3D than blocking 1.0g size
> 2010-08-11 03:59:16,853 INFO org.apache.hadoop.hbase.regionserver.HRegion=
:
> Blocking updates for 'IPC Server handler 24 on 60020' on region
> 2__HB_NOINC_GRID_0809-THREEGPPSPEECHCALLS-1281499094297,\x0E\x0E\x0E\x0E\=
x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E=
\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E,1281499095128:
> memstore size 1.0g is >=3D than blocking 1.0g size
> 2010-08-11 03:59:44,524 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError,
> aborting.
> java.lang.OutOfMemoryError: Java heap space
> =A0 =A0 =A0 =A0at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
> =A0 =A0 =A0 =A0at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) =A0 =
=A0 =A0 =A0at
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseSe=
rver.java:825)
> at
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:=
419)
> at
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318=
)
> 2010-08-11 03:59:44,525 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=3D0.0, regions=3D9, stores=3D22, storefiles=3D4, storefileIndexSi=
ze=3D5,
> memstoreSize=3D1502, compactionQueueSize=3D0, usedHeap=3D*3929*, maxHeap=
=3D3973,
> blockCacheSize=3D6836104, blockCacheFree=3D826362424, blockCacheCount=3D0=
,
> blockCacheHitRatio=3D0, fsReadLatency=3D0, fsWriteLatency=3D0, fsSyncLate=
ncy=3D0
>
> Among the other RS, the highest usedHeap is 1750
>
> On Sat, Jul 31, 2010 at 3:31 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>
>> Hi,
>>
>> #3 is going to be tricky... due to the ebb And flow of the gc this value
>> isn't as accurate as one would wish. Furthermore we flush nematodes base=
d
>> on
>> ram pressure.
>>
>> Any algorithm would have to have the property of being stable and
>> conservative... rebalancing is not a 0 impact operation.
>>
>> There are jiras open for the rebalance based on load. To date it hasn't
>> been
>> a practical problem here at SU in our prod clusters however.
>>
>> On Jul 31, 2010 3:18 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
>> > Hi,
>> > Currently load balancing only considers region count.
>> > See ServerManager.getAverageLoad()
>> >
>> > I think load balancing should consider the following three factors for
>> each
>> > RS:
>> > 1. number of regions it hosts
>> > 2. number of requests it serves within given period
>> > 3. how close usedHeap is to maxHeap
>> >
>> > Please comment how we should weigh the above three factors in deciding
>> the
>> > regions to offload from each RS.
>> >
>> > Thanks
>>
>