hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Cascading failure leads to loss of all region servers
Date Wed, 11 Apr 2012 20:35:40 GMT
On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
<bbeaudreault@hubspot.com> wrote:
> We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
> hosting about 17k regions.

Thats too many but thats another story.

> That pattern repeats on all of the region servers, every 5-8 minutes until
> all are down. Should there be some safeguards on a compaction causing a
> region server to go OOM?  The region appears to only be around 425mb in
> size.

My guess is that Region A has a massive or corrupt record in it.

You could disable the region for now while you are figuring whats wrong w/it.

If you list files under this region, what do you see?  Are there many?

Can you see what files are selected for compaction?  This will narrow
the set to look at.  You could poke at them w/ the hfile tool.  See
' HFile Tool' in the reference guide.


View raw message