hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@yahoo.com>
Subject Re: Cascading failure leads to loss of all region servers
Date Thu, 12 Apr 2012 05:58:44 GMT
One idea we took from the 0.89-FB branch is setting the internal scanner read batching for
compaction (compactionKVMax) to 1 as there isn't a benefit otherwise server side for compaction
and we run with heaps sometimes up at 90% utilization for a time as observed with JMX. Wonder
if that would have had an impact here. Just a random thought, pardon if the default is 1 (IIRC
it's 10) or something silly like that.

Best regards,

    - Andy

On Apr 11, 2012, at 6:17 PM, Bryan Beaudreault <bbeaudreault@hubspot.com> wrote:

> Hi Stack,
> Thanks for the reply.  Unfortunately, our first instinct was to restart the
> region servers and when they came up it appears the compaction was able to
> succeed (perhaps because on a fresh restart the heap was low enough to
> succeed).  I listed the files under that region and there is now only 1
> file.
> We are going to be running this job again in the near future.  We are going
> to try to rate limit the writes a bit (though only 10 reducers were running
> at once to begin with), and I will keep in mind your suggestions if it
> happens despite that.
> - Bryan
> On Wed, Apr 11, 2012 at 4:35 PM, Stack <stack@duboce.net> wrote:
>> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
>> <bbeaudreault@hubspot.com> wrote:
>>> We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
>>> hosting about 17k regions.
>> Thats too many but thats another story.
>>> That pattern repeats on all of the region servers, every 5-8 minutes
>> until
>>> all are down. Should there be some safeguards on a compaction causing a
>>> region server to go OOM?  The region appears to only be around 425mb in
>>> size.
>> My guess is that Region A has a massive or corrupt record in it.
>> You could disable the region for now while you are figuring whats wrong
>> w/it.
>> If you list files under this region, what do you see?  Are there many?
>> Can you see what files are selected for compaction?  This will narrow
>> the set to look at.  You could poke at them w/ the hfile tool.  See
>> ' HFile Tool' in the reference guide.
>> St.Ack

View raw message