hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rekha Joshi <rekha...@yahoo-inc.com>
Subject Re: compressed input splits to Map tasks
Date Thu, 15 Apr 2010 09:38:39 GMT
By default, with compressed files you lose the ability to control splits and the file is essentially
read as one split to one mapper.

There had been some discussion in and around this over bzip2, gzip and some fixes are done
to allow bzip2 to be splittable.Refer HADOOP-4012

Also Kevin came with lzo compression and LzoTextInputFormat which overcomes this disadvantage
and is faster. Refer http://github.com/kevinweil/hadoop-lzo


On 4/15/10 6:56 AM, "abhishek sharma" <absharma@usc.edu> wrote:

Hi all,

I created some data using the randomwriter utility and compressed the
map task outputs using the options
-D mapred.output.compress=true
-D mapred.map.output.compression.type=BLOCK

I set the bytes per map to be 128 MB but due to compression the final
size of each map tasks output is around 75MB.

I want to use these individual 75MB compressed files as input to
another Map task.
How do I get Hadoop to first decompress the files before computing the
input splits for the map tasks?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message