hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff LI <uniquej...@gmail.com>
Subject Difference between HDFS_BYTES_READ and the actual size of input files
Date Wed, 06 Mar 2013 18:21:09 GMT
Dear Hadoop Users,

I recently noticed there is a difference between the File System Counter
"HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs.
 And the difference seems to increase as the size of each key,value pairs
increases.

For example, I'm running the same job on two datasets.  The sizes of both
datasets are the same, which is about 128 GB.  And the keys are integers.
 The difference between these two datasets is the number of key,values
pairs and thus the size of each value: dataset1 has 2^17 key,value pairs
and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each
value.
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for
dataset2.

I have also tested on other sizes of each value.  There doesn't seem to be
any difference when the size of each value is small (<=128KB),
but noticeable difference when the size increases.

Could you give me some idea on why this is happening?

By the way, I'm running Hadoop 0.20.2-cdh3u5.

Cheers

Jeff

Mime
View raw message