hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: How is hadoop going to handle the next generation disks?
Date Fri, 08 Apr 2011 17:59:09 GMT
On Fri, Apr 8, 2011 at 12:24 PM, sridhar basam <sri@basam.org> wrote:
> BTW this is on systems which have a lot of RAM and aren't under high load.
> If you find that your system is evicting dentries/inodes from its cache, you
> might want to experiment with drop vm.vfs_cache_pressure from its default so
> that the they are preferred over the pagecache. At the extreme, setting it
> to 0 means they are never evicted.
>  Sridhar
> On Fri, Apr 8, 2011 at 11:37 AM, sridhar basam <sri@basam.org> wrote:
>> How many files do you have per node? What i find is that most of my
>> inodes/dentries are almost always cached so calculating the 'du -sk' on a
>> host even with hundreds of thousands of files the du -sk generally uses high
>> i/o for a couple of seconds. I am using 2TB disks too.
>>  Sridhar
>> On Fri, Apr 8, 2011 at 12:15 AM, Edward Capriolo <edlinuxguru@gmail.com>
>> wrote:
>>> I have a 0.20.2 cluster. I notice that our nodes with 2 TB disks waste
>>> tons of disk io doing a 'du -sk' of each data directory. Instead of
>>> 'du -sk' why not just do this with java.io.file? How is this going to
>>> work with 4TB 8TB disks and up ? It seems like calculating used and
>>> free disk space could be done a better way.
>>> Edward

Right. Most inodes are always cached when:

1) small disks
2) light load.

But that is not the case with hadoop.

Making the problem worse:
It seems like hadoop seems to issues 'du -sk' for all disks at the
same time. This pulverises cache.

All this to calculate a size that is typically within .01% of what a
df estimate would tell us.

View raw message