hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy <ferdy.gal...@kalooga.com>
Subject Re: Regionserver problems because of datanode timeouts
Date Mon, 14 Jun 2010 10:09:36 GMT
After running stable for quite a while (using configured long timeouts), 
we recently noticed regionservers were starting to behave bad again. 
During compaction, regionservers complained that blocks are unavailable. 
Every couple of days, a regionserver decided to terminate itself because 
it could not recover from the DFS errors.

So, after looking into it again, we might have found the actual cause 
for this problem. Prior to a regionserver terminate, logs of the 
corresponding datanode told us that the "df" command could not be ran 
because it could not allocate memory. Indeed, we finetuned our nodes to 
use nearly all RAM for the Hadoop/Hbase and child task processes. We had 
swap disabled. But we had the assumption that a simple "df" check should 
not be that expensive, right..?

Well it seems we had to learn a bit about "Linux memory overcommit". 
Without going into much details, spawning processes in Linux requires 
the new process the have the same memory footprint as the original 
process, more or less. Therefore, a datanode with 1.6GB heap (in our 
case) should have about the same amount of memory free when spawning a 
new process, even though the spawned process will do little to nothing. 
In order to accomodate, you should either have enough free memory 
available (fysical / swap) or you could tweak the 'overcommit' 
configuration of the operating system. We decided to increase the amount 
of memory by enabling swap files.

We're still running Hadoop 0.20.1 and Hbase 0.20.3, presumably the 
newest releases has better handling of errors in the 
DFSClient/InputStreams. Nevertheless, we believe that we have found the 
root cause of our regionserver problems.


Stack wrote:
> The culprit might be the fragmentation calculation.  See
> https://issues.apache.org/jira/browse/HBASE-2165.
> St.Ack
> On Wed, Mar 10, 2010 at 9:33 AM, Andrew Purtell <apurtell@apache.org> wrote:
>>> However, once and every while our Nagios (our service monitor)
>> detects
>>> that requesting the Hbase master page takes a long
>> time. Sometimes > 10
>>> sec, rarely around 30 secs but most of
>> the time < 10 secs. In the cases
>>> the page loads slowly,
>> there is a fair amount of load on Hbase.
>> I've noticed this also. With 0.20.4-dev. I think others have mentioned it
>> on the list from time to time. However, I can never seem to jump on to a
>> console fast enough to grab a stack dump before the UI becomes responsive
>> again. :-( It is not consistent behavior. It concerns me that perhaps
>> whatever lock is holding up the UI is also holding up any client
>> attempting to (re)locate a region. If I manage to capture it I will file
>> a jira.
>>   - Andy

View raw message