hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kim Chew <kchew...@gmail.com>
Subject Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
Date Thu, 27 Mar 2014 18:43:48 GMT
I have a simple M/R job using Mapper only thus no reducer. The mapper read
a timestamp from the value, generate a path to the output file and writes
the key and value to the output file.

The input file is a sequence file, not compressed and stored in the HDFS,
it has a size of 162.68 MB.

Output also is written as a sequence file.

However, after I ran my job, I have two output part files from the mapper.
One has a size of 835.12 MB and the other has a size of 224.77 MB. So why
is the total outputs size is so much larger? Shouldn't it be more or less
equal to the input's size of 162.68MB since I just write the key and value
passed to mapper to the output?

Here is the mapper code snippet,

public void map(BytesWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {

        long timestamp = bytesToInt(value.getBytes(), TIMESTAMP_INDEX);;
        String tsStr = sdf.format(new Date(timestamp * 1000L));

        mos.write(key, value, generateFileName(tsStr)); // mos is a
MultipleOutputs object.
    }

        private String generateFileName(String key) {
        return outputDir+"/"+key+"/raw-vectors";
    }

And here are the job outputs,

14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format Counters
14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1111374798
14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=166428672
14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap usage
(bytes)=38351872
14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=1240104960
14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0

TIA,

Kim

Mime
View raw message