hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Bentsen ...@bentzn.com>
Subject Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
Date Thu, 27 Mar 2014 19:30:30 GMT
Have you checked the content of the files you write?


/th

On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> I have a simple M/R job using Mapper only thus no reducer. The mapper
> read a timestamp from the value, generate a path to the output file
> and writes the key and value to the output file.
> 
> 
> The input file is a sequence file, not compressed and stored in the
> HDFS, it has a size of 162.68 MB.
> 
> 
> Output also is written as a sequence file.
> 
> 
> 
> However, after I ran my job, I have two output part files from the
> mapper. One has a size of 835.12 MB and the other has a size of 224.77
> MB. So why is the total outputs size is so much larger? Shouldn't it
> be more or less equal to the input's size of 162.68MB since I just
> write the key and value passed to mapper to the output?
> 
> 
> Here is the mapper code snippet,
> 
> public void map(BytesWritable key, BytesWritable value, Context
> context) throws IOException, InterruptedException {
> 
>         long timestamp = bytesToInt(value.getBytes(),
> TIMESTAMP_INDEX);;
>         String tsStr = sdf.format(new Date(timestamp * 1000L));
>         
>         mos.write(key, value, generateFileName(tsStr)); // mos is a
> MultipleOutputs object.
>     }
> 
>         private String generateFileName(String key) {
>         return outputDir+"/"+key+"/raw-vectors";
>     }
> 
> 
> And here are the job outputs,
> 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> 14/03/27 11:00:56 INFO mapred.JobClient:
> HDFS_BYTES_WRITTEN=1111374798
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=166428672
> 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> usage (bytes)=38351872
> 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1240104960
> 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> 
> 
> TIA,
> 
> 
> Kim
> 



Mime
View raw message