hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: DataNode stops cleaning disk?
Date Thu, 05 Mar 2009 20:06:03 GMT
Igor Bolotin wrote:
> That's what I saw just yesterday on one of the data nodes with this
> situation (will confirm also next time it happens):
> - Tmp and current were either empty or almost empty last time I checked.
> - du on the entire data directory matched exactly with reported used
> space in NameNode web UI and it did report that it uses some most of the
> available disk space. 
> - nothing else was using disk space (actually - it's dedicated DFS
> cluster).

If 'du' command (you can run in the shell) counts properly then you 
should be able to see which files are taking space.

If 'du' can't but 'df' reports very less space available, then it is 
possible (though never seen) that datanode is keeping a a lot these 
files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
not datanode, then check lsof to find who is holding these files.

hope this helps.

> Thank you for help!
> Igor
> -----Original Message-----
> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
> Sent: Thursday, March 05, 2009 11:05 AM
> To: core-user@hadoop.apache.org
> Subject: Re: DataNode stops cleaning disk?
> This is unexpected unless some other process is eating up space.
> Couple of things to collect next time (along with log):
>   - All the contents under datanode-directory/ (especially including 
> 'tmp' and 'current')
>   - Does 'du' of this directory match with what is reported to NameNode 
> (shown on webui) by this DataNode.
>   - Is there anything else taking disk space on the machine?
> Raghu.
> Igor Bolotin wrote:
>> Normally I dislike writing about problems without being able to
> provide
>> some more information, but unfortunately in this case I just can't
> find
>> anything.
>> Here is the situation - DFS cluster running Hadoop version 0.19.0. The
>> cluster is running on multiple servers with practically identical
>> hardware. Everything works perfectly well, except for one thing - from
>> time to time one of the data nodes (every time it's a different node)
>> starts to consume more and more disk space. The node keeps going and
> if
>> we don't do anything - it runs out of space completely (ignoring 20GB
>> reserved space settings). Once restarted - it cleans disk rapidly and
>> goes back to approximately the same utilization as the rest of data
>> nodes in the cluster.
>> Scanning datanodes and namenode logs and comparing thread dumps
> (stacks)
>> from nodes experiencing problem and those that run normally didn't
>> produce any clues. Running balancer tool didn't help at all. FSCK
> shows
>> that everything is healthy and number of over-replicated blocks is not
>> significant.
>> To me - it just looks like at some point the data node stops cleaning
>> invalidated/deleted blocks, but keeps reporting space consumed by
> these
>> blocks as "not used", but I'm not familiar enough with the internals
> and
>> just plain don't have enough free time to start digging deeper.
>> Anyone has an idea what is wrong or what else we can do to find out
>> what's wrong or maybe where to start looking in the code?
>> Thanks,
>> Igor

View raw message