hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seraph Imalia <ser...@eisp.co.za>
Subject Re: Hbase pausing problems
Date Mon, 18 Jan 2010 09:03:46 GMT
Answers below...

Regards,
Seraph

> From: stack <stack@duboce.net>
> Reply-To: <hbase-user@hadoop.apache.org>
> Date: Fri, 15 Jan 2010 10:10:39 -0800
> To: <hbase-user@hadoop.apache.org>
> Subject: Re: Hbase pausing problems
> 
> How many CPUs?

1x Quad Xeon in each server

> 
> You are using default JVM settings (see HBASE_OPTS in hbase-env.sh).  You
> might want to enable GC logging.  See the line after hbase-env.sh.  Enable
> it.  GC logging might tell you about the pauses you are seeing.

I will enable GC Logging tonight during our slow time because restarting the
regionservers causes the clients to pause indefinitely.

> 
> Can you get a fourth server for your cluster and run the master, zk, and
> namenode on it and leave the other three servers for regionserver and
> datanode (with perhaps replication == 2 as per J-D to lighten load on small
> cluster).

We plan to double the number of servers in the next few weeks and I will
take your advice to put the master, zk and namenode on it (we will need to
have a second one on standby should this one crash).  The servers will be
ordered shortly and will be here in a week or two.

That said, I have been monitoring CPU usage and none of them seem
particularly busy.  The regionserver on each one hovers around 30% all the
time and the datanode sits at about 10% most of the time.  If we do have a
resource issue, it definitely does not seem to be CPU.

Increasing RAM did not seem to work either - it just made hBase use a bigger
memstore and then it took longer to do a flush.
 

> 
> More notes inline in below.
> 
> On Fri, Jan 15, 2010 at 1:33 AM, Seraph Imalia <seraph@eisp.co.za> wrote:
> 
>> Approximately every 10 minutes, our entire coldfusion system pauses at the
>> point of inserting into hBase for between 30 and 60 seconds and then
>> continues.
>> 
>> Yeah, enable GC logging.  See if you can make correlation between the pause
> the client is seeing and a GC pause.
> 
> 
> 
> 
>> Investigation...
>> 
>> Watching the logs of the regionserver, the pausing of the coldfusion system
>> happens as soon as one of the regionservers starts flushing the memstore
>> and
>> recovers again as soon as it is finished flushing (recovers as soon as it
>> starts compacting).
>> 
> 
> 
> ...though, this would seem to point to an issue with your hardware.  How
> many disks?  Are they misconfigured such that they hold up the system when
> they are being heavily written to?
> 
> 
> A regionserver log at DEBUG from around this time so we could look at it
> would be helpful.
> 
> 
> I can recreate the error just by stopping 1 of the regionservers; but then
>> starting the regionserver again does not make coldfusion recover until I
>> restart the coldfusion servers.  It is important to note that if I keep the
>> built in hBase shell running, it is happily able to put and get data to and
>> from hBase whilst coldfusion is busy pausing/failing.
>> 
> 
> This seems odd.  Enable DEBUG for the client-side.  Do you see the shell
> recalibrating finding new locations for regions after you shutdown the
> single regionserver, something that your coldfusion is not doing?  Or,
> maybe, the shell is putting a regionserver that has not been disturbed by
> your start/stop?
> 
> 
>> 
>> I have tried increasing the regionserver¹s RAM to 3 Gigs and this just made
>> the problem worse because it took longer for the regionservers to flush the
>> memory store.
> 
> 
> Again, if flushing is holding up the machine, if you can't write a file in
> background without it freezing your machine, then your machines are anemic
> or misconfigured?
> 
> 
>> One of the links I found on your site mentioned increasing
>> the default value for hbase.regionserver.handler.count to 100 ­ this did
>> not
>> seem to make any difference.
> 
> 
> Leave this configuration in place I'd say.
> 
> Are you seeing 'blocking' messages in the regionserver logs?  Regionserver
> will stop taking on writes if it thinks its being overrun to prevent itself
> OOME'ing.  Grep the 'multiplier' configuration in hbase-default.xml.
> 
> 
> 
>> I have double checked that the memory flush
>> very rarely happens on more than 1 regionserver at a time ­ in fact in my
>> many hours of staring at tails of logs, it only happened once where two
>> regionservers flushed at the same time.
>> 
>> You've enabled DEBUG?
> 
> 
> 
>> My investigations point strongly towards a coding problem on our side
>> rather
>> than a problem with the server setup or hBase itself.
> 
> 
> If things were slow from client-perspective, that might be a client-side
> coding problem but these pauses, unless you have a fly-by deadlock in your
> client-code, its probably an hbase issue.
> 
> 
> 
>>  I say this because
>> whilst I understand why a regionserver would go offline during a memory
>> flush, I would expect the other two regionservers to pick up the load ­
>> especially since the built-in hbase shell has no problem accessing hBase
>> whilst a regionserver is busy doing a memstore flush.
>> 
>> HBase does not go offline during memory flush.  It continues to be
> available for reads and writes during this time.  And see J-D response for
> incorrect understanding of how loading of regions is done in an hbase
> cluster.
> 
> 
> 
> ...
> 
> 
> I think either I am leaving out code that is required to determine which
>> RegionServers are available OR I am keeping too many hBase objects in RAM
>> instead of calling their constructors each time (my purpose obviously was
>> to
>> improve performance).
>> 
>> 
> For sure keep single instance of HBaseConfiguration at least and use this
> constructing all HTable and HBaseAdmin instances.
> 
> 
> 
>> Currently the live system is inserting over 7 Million records per day
>> (mostly between 8AM and 10PM) which is not a ridiculously high load.
>> 
>> 
> What size are the records?   What is your table schema?  How many regions do
> you currently have in your table?
> 
>  St.Ack





Mime
View raw message