hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Buell <jbu...@vmware.com>
Subject Re: Difference between HDFS_BYTES_READ and the actual size of input files
Date Wed, 06 Mar 2013 19:12:16 GMT

Probably because records are split across blocks, so some of the data has to be read twice.
Assuming you have a 64 MB block size and 128 GB of data, I'd estimate the overhead at 1 GB
for 1 MB record size, and 32 GB for 32 MB record size. Your overhead is about 75% of that,
maybe my arithmetic is off, or there is some intelligence in HDFS to reduce how often records
are split across blocks (say if a big record needs to be written but there is little space
left in the current block, anybody know?). Test this theory by increasing the HDFS block size
to 128 MB or 256 MB, and then re-create or copy dataset2. Overhead should be 1/2 or 1/4 of
what it is now. 


----- Original Message -----

From: "Jeff LI" <uniquejeff@gmail.com> 
To: user@hadoop.apache.org 
Sent: Wednesday, March 6, 2013 10:21:09 AM 
Subject: Difference between HDFS_BYTES_READ and the actual size of input files 

Dear Hadoop Users, 

I recently noticed there is a difference between the File System Counter "HDFS_BYTES_READ"
and the actual size of input files in map-reduce jobs. And the difference seems to increase
as the size of each key,value pairs increases. 

For example, I'm running the same job on two datasets. The sizes of both datasets are the
same, which is about 128 GB. And the keys are integers. The difference between these two datasets
is the number of key,values pairs and thus the size of each value: dataset1 has 2^17 key,value
pairs and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each value. 
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for dataset2. 

I have also tested on other sizes of each value. There doesn't seem to be any difference when
the size of each value is small (<=128KB), but noticeable difference when the size increases.

Could you give me some idea on why this is happening? 

By the way, I'm running Hadoop 0.20.2-cdh3u5. 



View raw message