hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Bolotin" <ig...@collarity.com>
Subject RE: DataNode stops cleaning disk?
Date Thu, 05 Mar 2009 19:46:22 GMT
That's what I saw just yesterday on one of the data nodes with this
situation (will confirm also next time it happens):
- Tmp and current were either empty or almost empty last time I checked.
- du on the entire data directory matched exactly with reported used
space in NameNode web UI and it did report that it uses some most of the
available disk space. 
- nothing else was using disk space (actually - it's dedicated DFS

Thank you for help!

-----Original Message-----
From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 11:05 AM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?

This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

  - All the contents under datanode-directory/ (especially including 
'tmp' and 'current')
  - Does 'du' of this directory match with what is reported to NameNode 
(shown on webui) by this DataNode.
  - Is there anything else taking disk space on the machine?


Igor Bolotin wrote:
> Normally I dislike writing about problems without being able to
> some more information, but unfortunately in this case I just can't
> anything.
> Here is the situation - DFS cluster running Hadoop version 0.19.0. The
> cluster is running on multiple servers with practically identical
> hardware. Everything works perfectly well, except for one thing - from
> time to time one of the data nodes (every time it's a different node)
> starts to consume more and more disk space. The node keeps going and
> we don't do anything - it runs out of space completely (ignoring 20GB
> reserved space settings). Once restarted - it cleans disk rapidly and
> goes back to approximately the same utilization as the rest of data
> nodes in the cluster.
> Scanning datanodes and namenode logs and comparing thread dumps
> from nodes experiencing problem and those that run normally didn't
> produce any clues. Running balancer tool didn't help at all. FSCK
> that everything is healthy and number of over-replicated blocks is not
> significant.
> To me - it just looks like at some point the data node stops cleaning
> invalidated/deleted blocks, but keeps reporting space consumed by
> blocks as "not used", but I'm not familiar enough with the internals
> just plain don't have enough free time to start digging deeper.
> Anyone has an idea what is wrong or what else we can do to find out
> what's wrong or maybe where to start looking in the code?
> Thanks,
> Igor

View raw message