hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ramkrishna vasudevan <ramkrishna.s.vasude...@gmail.com>
Subject Re: Struggling with Region Servers Running out of Memory
Date Tue, 30 Oct 2012 06:43:15 GMT
Hi

Are you using any coprocessors? Can you see how many store files are
created?

The no of blocks getting cached will give you an idea too..

Regards
Ram

On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <jeffw@qualtrics.com> wrote:

> We have 6 region server given 10G of memory for hbase.  Each region server
> has an average of about 100 regions and across the cluster we are averaging
> about 100 requests / second with a pretty even read / write load.  We are
> running cdh4 (0.92.1-cdh4.0.1, rUnknown)
>
> I feel that looking over our load and our requests that the 10GB of memory
> should be enough to handle the load and that we shouldn't really be pushing
> the the memory limits.
>
> However what we are seeing is that our memory usage goes up slowly until
> the region server starts sputtering due to gc collection issues and it will
> eventually get timed out by zookeeper and be killed.
>
> We'll see aborts like this in the log:
> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547:
> Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException:
> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net,60020,**1351233245547
> as dead server
> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> RegionServer abort: loaded coprocessors are: []
> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547:
> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf
> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received
> expired from ZooKeeper, aborting
> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> RegionServer abort: loaded coprocessors are: []
>
> Which are "caused" by:
> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 29014ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 28121ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 31124ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 32209ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 32557ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 33741ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>
>
> We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks
> in and really kills the region server's performance.
>
>
> We have the jvm metrics kicking out to ganglia and looking at
> jvm.RegionServer.metrics.**memHeapUsedM you can see that it will go up
> over time and eventually run out of memory.  I can also see in
> hmaster:60010/master-status that the usedHeapMB just goes up and I can make
> a pretty educated guess as to what server will go down next.  It will take
> several days to a week of continuous running (after restarting a region
> server) before we have a potential problem.
>
> Our next one to go will probably be ds6 and jmap -heap shows:
> concurrent mark-sweep generation:
>    capacity = 10398531584 (9916.8125MB)
>    used     = 9036165000 (8617.558479309082MB)
>    free     = 1362366584 (1299.254020690918MB)
>    86.89847145248619% used
>
> So we are using 86% of the 10GB heep allocated to the concurrent mark and
> sweep generation.  Looking at ds6 in the web interface where has
> information about the a tasks it isn't running rpc stuff it doesn't show
> any compactions or any background tasks happening. Nor is there any active
> rpc call that are longer than 0 seconds (it seems to be handling the
> requests just fine).
>
> At this point I feel somewhat lost as to how to debug the problem. I'm not
> sure what to do next to figure out what is going on.  Any suggestions as to
> what to look for or debug where the memory is being used? I can generate
> heap dumps via jmap (although it effectively kills the region server) but I
> don't really know what to look for to see where the memory is going. I also
> have jmx setup on each region server and can connect to it that way.
>
> Thanks,
> ~Jeff
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message