hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Probably lack-of-HADOOP-1700 causing DATA LOSS
Date Thu, 22 Jan 2009 20:10:03 GMT
Genady Gillin wrote:
> Hi,
> Sorry for my lack of experience , but what you mean by I/O character,nicing
> and preferencing the region servers? We're running Hadoop on JDK 1.6.11

Try 'man nice' and 'man ionice'.  Base point is that processes may be 
starved of other than CPU.

Your JVM looks good.

Do you run ganglia or anything so you can see what else is happening on 
these machines when you see '... slept too long..' in the hbase logs?

Might be lots of threads up in the datanodes using up memory outside of 
the datanode heap.

Do you have dfs.datanode.socket.write.timeout set to zero on your 
cluster?  Since you are using 0.19.0 hadoop, you might let this be the 
default instead (means unused threads and sockets are let go by the 
datanode -- client will reestablish behind-the-scenes if it needs access 
to the file again).


> Gennady
> On Thu, Jan 22, 2009 at 9:18 PM, stack <stack@duboce.net> wrote:
>> Genady Gillin wrote:
>>> Please see my answers below:
>>> *
>>> These are 'classic' failure modes Genady.  Generally its indication that
>>> host is so loaded, hbase + datanode are starved of timeslices.  Later you
>>> say you are swapping.  At time of above, perhaps machine was swapping
>>> hard?
>>>  Anything else on these machines?  Anything in your messages log?*
>>> That's what i thought as well, although I wasn't sure since CPU on this
>>> host
>>> was about 80% and swapping about 40-50%, this computer is running another
>>> program with permanent CPU 30%.
>> Whats the 'load' on the machine?  CPU is one thing but whats the i/o
>> character like?  For kicks, if you niced the datanodes and regionservers so
>> they have preference -- you might have to do i/o nicing too -- do the issues
>> go away?  What JVM version are you running?
>> St.Ack

View raw message