hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: web interface is fragile?
Date Thu, 01 Apr 2010 01:29:41 GMT
The fact we see the exception 10 times means that
getRegionServerWithRetries got that error 10 times before
abandoning... Are you sure you don't see that on the region server's
log located at



On Wed, Mar 31, 2010 at 6:26 PM, Buttler, David <buttler1@llnl.gov> wrote:
> Hi J-D,
> Thanks for taking a look at this.  The error that I received is:
> http://pastebin.com/ZnhVA5B0
> This is the client side.
> I little strange as I have been running this task several times in the past, and my client
heap size is set to 4GB.  I can try doubling it and see if that helps
> Dave
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
> Sent: Wednesday, March 31, 2010 6:11 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: web interface is fragile?
> Dave,
> Can you pastebin the exact error that was returned by the MR job? That
> looks like it's client-side (from HBase point of view).
> WRT the .META. and the master, the web page does do a request on every
> hit so if the region is unavailable then you can't see it. Looks like
> you kill -9'ed the region server? If so, it takes a minute to detect
> the region server failure and then split the write-ahead-logs so if
> .META. was on that machine, it will take that much time to have a
> working web page.
> Instead of kill -9, simply go on the node and run
> ./bin/hbase-daemon.sh stop regionserver
> J-D
> On Wed, Mar 31, 2010 at 5:51 PM, Buttler, David <buttler1@llnl.gov> wrote:
>> Hi,
>> I have a small cluster (6 nodes, 1 master and 5 region server/data nodes).  Each
node has lots of memory and disk (16GB of heap dedicated to RegionServers), 4 TB of disk per
node for hdfs.
>> I have a table with about 1 million rows in hbase - that's all.  Currently it is
split across 50 regions.
>> I was monitoring this with the hbase web gui and I noticed that a lot of the heap
was being used (14GB).  I was running a MR job and I was getting an error to the console
that launched the job:
>> Error: GC overhead limit exceeded hbase
>> First question: is this going to hose the whole system?  I didn't see the error
in any of the hbase logs, so I assume that it was purely a client issue.
>> So, naively thinking that maybe the GC had moved everything to permgen and just wasn't
cleaning up, I thought I would do a rolling restart of my region servers and see if that cleared
everything up.  The first server I killed happened to be the one that was hosting the .META.
table.  Subsequently the web gui failed.  Looking at the errors, it seems that the web gui
essentially caches the address for the meta table and blindly tries connecting on every request.
 I suppose I could restart the master, but this does not seem like desirable behavior.  Shouldn't
the cache be refreshed on error?  And since there is no real code for the GUI, just a jsp
page, doesn't this mean that this behavior could be seen in other applications that use HMaster?
>> Corrections welcome
>> Dave

View raw message