hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seraph Imalia <ser...@eisp.co.za>
Subject Re: Hbase pausing problems
Date Wed, 20 Jan 2010 09:06:12 GMT
Hi Jean-Daniel,

I have uploaded all the hbase logs for 2010-01-17 for all three
regionservers and the master to http://rapidshare.com/users/6Q0621

The client stops being able to write to hBase as soon as 1 of the
regionservers starts doing this...

2010-01-17 01:16:25,729 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
ChannelDelivery,5352f559-d68e-42e9-be92-8bae82185ed1,1262544772804 because
global memstore limit of 396.7m exceeded; currently 396.7m and flushing till
247.9m

Or this...

2010-01-17 01:16:26,159 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
AdDelivery,613a401d-fb8a-42a9-aac6-d957f6281035,1261867806692 because global
memstore limit of 396.7m exceeded; currently 390.4m and flushing till 247.9m

And then as soon as it finishes that, it starts doing this...

2010-01-17 01:16:36,709 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region
AdDelivery,fb98f6c9-db13-4853-92ee-ffe1182fffd0,1263544763046/350999600
because: regionserver/192.168.2.88:60020.cacheFlusher

And as soon as it has finished the last of the Compaction Requests, the
client recovers and the regionserver starts doing this...

2010-01-17 01:16:36,713 DEBUG org.apache.hadoop.hbase.regionserver.Store:
Compaction size of ChannelDelivery_Family: 209.5m; Skipped 1 file(s), size:
216906650
2010-01-17 01:16:36,713 DEBUG org.apache.hadoop.hbase.regionserver.Store:
Started compaction of 3 file(s)  into
/hbase/ChannelDelivery/compaction.dir/165262792, seqid=1241653592
2010-01-17 01:16:37,143 DEBUG org.apache.hadoop.hbase.regionserver.Store:
Completed compaction of ChannelDelivery_Family; new storefile is
hdfs://dynobuntu6:8020/hbase/ChannelDelivery/165262792/ChannelDelivery_Famil
y/1673693545539520912; store size is 209.5m

All of these logs seem perfectly acceptable to me - the problem is that it
just requires one of the regionservers to start doing this for the client to
be prevented from inserting new rows into hBase.  The logs don't seem to
explain why this is happening.

Thank you for your assistance thus far; please let me know if you need or
discover anything else?

Regards,
Seraph



> From: Jean-Daniel Cryans <jdcryans@apache.org>
> Reply-To: <hbase-user@hadoop.apache.org>
> Date: Mon, 18 Jan 2010 09:49:16 -0800
> To: <hbase-user@hadoop.apache.org>
> Subject: Re: Hbase pausing problems
> 
> The next step would be to take a look at your region server's log
> around the time of the insert and clients who don't resume after the
> loss of a region server. If you are able to gzip them and put them on
> a public server, it would be awesome.
> 
> Thx,
> 
> J-D
> 
> On Mon, Jan 18, 2010 at 1:03 AM, Seraph Imalia <seraph@eisp.co.za> wrote:
>> Answers below...
>> 
>> Regards,
>> Seraph
>> 
>>> From: stack <stack@duboce.net>
>>> Reply-To: <hbase-user@hadoop.apache.org>
>>> Date: Fri, 15 Jan 2010 10:10:39 -0800
>>> To: <hbase-user@hadoop.apache.org>
>>> Subject: Re: Hbase pausing problems
>>> 
>>> How many CPUs?
>> 
>> 1x Quad Xeon in each server
>> 
>>> 
>>> You are using default JVM settings (see HBASE_OPTS in hbase-env.sh).  You
>>> might want to enable GC logging.  See the line after hbase-env.sh.  Enable
>>> it.  GC logging might tell you about the pauses you are seeing.
>> 
>> I will enable GC Logging tonight during our slow time because restarting the
>> regionservers causes the clients to pause indefinitely.
>> 
>>> 
>>> Can you get a fourth server for your cluster and run the master, zk, and
>>> namenode on it and leave the other three servers for regionserver and
>>> datanode (with perhaps replication == 2 as per J-D to lighten load on small
>>> cluster).
>> 
>> We plan to double the number of servers in the next few weeks and I will
>> take your advice to put the master, zk and namenode on it (we will need to
>> have a second one on standby should this one crash).  The servers will be
>> ordered shortly and will be here in a week or two.
>> 
>> That said, I have been monitoring CPU usage and none of them seem
>> particularly busy.  The regionserver on each one hovers around 30% all the
>> time and the datanode sits at about 10% most of the time.  If we do have a
>> resource issue, it definitely does not seem to be CPU.
>> 
>> Increasing RAM did not seem to work either - it just made hBase use a bigger
>> memstore and then it took longer to do a flush.
>> 
>> 
>>> 
>>> More notes inline in below.
>>> 
>>> On Fri, Jan 15, 2010 at 1:33 AM, Seraph Imalia <seraph@eisp.co.za> wrote:
>>> 
>>>> Approximately every 10 minutes, our entire coldfusion system pauses at the
>>>> point of inserting into hBase for between 30 and 60 seconds and then
>>>> continues.
>>>> 
>>>> Yeah, enable GC logging.  See if you can make correlation between the pause
>>> the client is seeing and a GC pause.
>>> 
>>> 
>>> 
>>> 
>>>> Investigation...
>>>> 
>>>> Watching the logs of the regionserver, the pausing of the coldfusion system
>>>> happens as soon as one of the regionservers starts flushing the memstore
>>>> and
>>>> recovers again as soon as it is finished flushing (recovers as soon as it
>>>> starts compacting).
>>>> 
>>> 
>>> 
>>> ...though, this would seem to point to an issue with your hardware.  How
>>> many disks?  Are they misconfigured such that they hold up the system when
>>> they are being heavily written to?
>>> 
>>> 
>>> A regionserver log at DEBUG from around this time so we could look at it
>>> would be helpful.
>>> 
>>> 
>>> I can recreate the error just by stopping 1 of the regionservers; but then
>>>> starting the regionserver again does not make coldfusion recover until I
>>>> restart the coldfusion servers.  It is important to note that if I keep
the
>>>> built in hBase shell running, it is happily able to put and get data to and
>>>> from hBase whilst coldfusion is busy pausing/failing.
>>>> 
>>> 
>>> This seems odd.  Enable DEBUG for the client-side.  Do you see the shell
>>> recalibrating finding new locations for regions after you shutdown the
>>> single regionserver, something that your coldfusion is not doing?  Or,
>>> maybe, the shell is putting a regionserver that has not been disturbed by
>>> your start/stop?
>>> 
>>> 
>>>> 
>>>> I have tried increasing the regionserver¹s RAM to 3 Gigs and this just made
>>>> the problem worse because it took longer for the regionservers to flush the
>>>> memory store.
>>> 
>>> 
>>> Again, if flushing is holding up the machine, if you can't write a file in
>>> background without it freezing your machine, then your machines are anemic
>>> or misconfigured?
>>> 
>>> 
>>>> One of the links I found on your site mentioned increasing
>>>> the default value for hbase.regionserver.handler.count to 100 ­ this did
>>>> not
>>>> seem to make any difference.
>>> 
>>> 
>>> Leave this configuration in place I'd say.
>>> 
>>> Are you seeing 'blocking' messages in the regionserver logs?  Regionserver
>>> will stop taking on writes if it thinks its being overrun to prevent itself
>>> OOME'ing.  Grep the 'multiplier' configuration in hbase-default.xml.
>>> 
>>> 
>>> 
>>>> I have double checked that the memory flush
>>>> very rarely happens on more than 1 regionserver at a time ­ in fact in my
>>>> many hours of staring at tails of logs, it only happened once where two
>>>> regionservers flushed at the same time.
>>>> 
>>>> You've enabled DEBUG?
>>> 
>>> 
>>> 
>>>> My investigations point strongly towards a coding problem on our side
>>>> rather
>>>> than a problem with the server setup or hBase itself.
>>> 
>>> 
>>> If things were slow from client-perspective, that might be a client-side
>>> coding problem but these pauses, unless you have a fly-by deadlock in your
>>> client-code, its probably an hbase issue.
>>> 
>>> 
>>> 
>>>>  I say this because
>>>> whilst I understand why a regionserver would go offline during a memory
>>>> flush, I would expect the other two regionservers to pick up the load ­
>>>> especially since the built-in hbase shell has no problem accessing hBase
>>>> whilst a regionserver is busy doing a memstore flush.
>>>> 
>>>> HBase does not go offline during memory flush.  It continues to be
>>> available for reads and writes during this time.  And see J-D response for
>>> incorrect understanding of how loading of regions is done in an hbase
>>> cluster.
>>> 
>>> 
>>> 
>>> ...
>>> 
>>> 
>>> I think either I am leaving out code that is required to determine which
>>>> RegionServers are available OR I am keeping too many hBase objects in RAM
>>>> instead of calling their constructors each time (my purpose obviously was
>>>> to
>>>> improve performance).
>>>> 
>>>> 
>>> For sure keep single instance of HBaseConfiguration at least and use this
>>> constructing all HTable and HBaseAdmin instances.
>>> 
>>> 
>>> 
>>>> Currently the live system is inserting over 7 Million records per day
>>>> (mostly between 8AM and 10PM) which is not a ridiculously high load.
>>>> 
>>>> 
>>> What size are the records?   What is your table schema?  How many regions do
>>> you currently have in your table?
>>> 
>>>  St.Ack
>> 
>> 
>> 
>> 
>> 





Mime
View raw message