hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sridhar basam <...@basam.org>
Subject Re: How is hadoop going to handle the next generation disks?
Date Fri, 08 Apr 2011 18:51:01 GMT
On Fri, Apr 8, 2011 at 1:59 PM, Edward Capriolo <edlinuxguru@gmail.com>wrote:

>
> Right. Most inodes are always cached when:
>
> 1) small disks
> 2) light load.
>

But that is not the case with hadoop.
>
> Making the problem worse:
> It seems like hadoop seems to issues 'du -sk' for all disks at the
> same time. This pulverises cache.
>
> All this to calculate a size that is typically within .01% of what a
> df estimate would tell us.
>

Don't know your setup but i think this is manageble in the short-medium
term. Even with a 20TB node, you  are likely looking at much less than a
million files depending on your configuration and usage. I would much rather
blow 500MB-1GB on keeping these entries in RAM vs the pagecache where most
it probably ends up hitting  the disks anyway.

The one case where i think the du is needed is for when people haven't
dedicated the entire space on a drive to hadoop. Using df in this case
wouldn't accurately reflect usage.

 Sridhar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message