hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Large data sets
Date Tue, 06 Feb 2007 20:01:39 GMT
Konstantin Shvachko wrote:
> 200 bytes per file is theoretically correct, but rather optimistic :-(
>  From a real system memory utilization I can see that HDFS uses 1.5-2K 
> per file.
> And since each real file is internally represented by two files (1 real 
> + 1 crc) the real
> estimate per file should read 3-4K.

But also note that there are plans to address these over the coming 
months.  For a start:

https://issues.apache.org/jira/browse/HADOOP-803
https://issues.apache.org/jira/browse/HADOOP-928

Once checksums are optional then we can replace their implementation in 
HDFS with something that does not consume namespace.

Long term we hope to approach ~100 bytes per file.

Doug

Mime
View raw message