hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hardik Pandya <smarty.ju...@gmail.com>
Subject Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
Date Fri, 28 Mar 2014 17:34:40 GMT
what is your compression format gzip, lzo or snappy

for lzo final output

FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.


On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kchew534@gmail.com> wrote:

> Thanks folks.
>
> I am not awared my input data file has been compressed.
> FileOutputFromat.setCompressOutput() is set to true when the file is
> written. 8-(
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mostafa.g.ead@gmail.com>wrote:
>
>> The following might answer you partially:
>>
>> Input key is not read from HDFS, it is auto generated as the offset of
>> the input value in the input file. I think that is (partially) why read
>> hdfs bytes is smaller than written hdfs bytes.
>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kchew534@gmail.com> wrote:
>>
>>> I am also wondering if, say, I have two identical timestamp so they are
>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>
>>> Thanks.
>>>
>>> Kim
>>>
>>>
>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th@bentzn.com> wrote:
>>>
>>>> Have you checked the content of the files you write?
>>>>
>>>>
>>>> /th
>>>>
>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>> > read a timestamp from the value, generate a path to the output file
>>>> > and writes the key and value to the output file.
>>>> >
>>>> >
>>>> > The input file is a sequence file, not compressed and stored in the
>>>> > HDFS, it has a size of 162.68 MB.
>>>> >
>>>> >
>>>> > Output also is written as a sequence file.
>>>> >
>>>> >
>>>> >
>>>> > However, after I ran my job, I have two output part files from the
>>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>> > write the key and value passed to mapper to the output?
>>>> >
>>>> >
>>>> > Here is the mapper code snippet,
>>>> >
>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>> > context) throws IOException, InterruptedException {
>>>> >
>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>> > TIMESTAMP_INDEX);;
>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>> >
>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>> > MultipleOutputs object.
>>>> >     }
>>>> >
>>>> >         private String generateFileName(String key) {
>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>> >     }
>>>> >
>>>> >
>>>> > And here are the job outputs,
>>>> >
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>> > Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>> > snapshot=166428672
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>> > usage (bytes)=38351872
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>> > snapshot=1240104960
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> >
>>>> > TIA,
>>>> >
>>>> >
>>>> > Kim
>>>> >
>>>>
>>>>
>>>>
>>>
>

Mime
View raw message