hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj k <devara...@huawei.com>
Subject RE: CompressionCodec in MapReduce
Date Wed, 11 Apr 2012 08:37:11 GMT
Hi Grzegorz,

    You can find the below properties for Job input and output compression:

The below prop is used by the codec factory. This codec will be taken based on the type(i.e
suffix) of the file. By default the LineRecordReador which is used by FileInputFormat uses
this. If you want the compression for inputs in otherway you can write input format according
to that.

core-site.xml:
---------------

<property> 
  <name>io.compression.codecs</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>

  <description>A list of the compression codec classes that can be used 
               for compression/decompression.</description> 
</property> 


   I am not sure which version of hadoop you are using. I am giving the props for newer and
older versions. These are the props you need to configure if you want to compress job outputs.
These works only when the output format is FileOutputFormat.

mapred-site.xml:(for version 0.23  and later)
---------------------------------------------------

<property> 
  <name>mapreduce.output.fileoutputformat.compress</name> 
  <value>false</value> 
  <description>Should the job outputs be compressed? 
  </description> 
</property> 

<property> 
  <name>mapreduce.output.fileoutputformat.compression.type</name> 
  <value>RECORD</value> 
  <description>If the job outputs are to compressed as SequenceFiles, how should 
               they be compressed? Should be one of NONE, RECORD or BLOCK. 
  </description> 
</property> 

<property> 
  <name>mapreduce.output.fileoutputformat.compression.codec</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
  <description>If the job outputs are compressed, how should they be compressed? 
  </description> 
</property> 




mapred-site.xml:(for older versions)
------------------------------------------

<property> 
  <name>mapred.output.compress</name> 
  <value>false</value> 
  <description>Should the job outputs be compressed? 
  </description> 
</property> 

<property> 
  <name>mapred.output.compression.type</name> 
  <value>RECORD</value> 
  <description>If the job outputs are to compressed as SequenceFiles, how should 
               they be compressed? Should be one of NONE, RECORD or BLOCK. 
  </description> 
</property> 

<property> 
  <name>mapred.output.compression.codec</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
  <description>If the job outputs are compressed, how should they be compressed? 
  </description> 
</property> 


If you want to use compression with your custom input and out formats, you can implement the
compression in those classes.


Thanks
Devaraj
________________________________________
From: Grzegorz Gunia [sawtyss@student.agh.edu.pl]
Sent: Wednesday, April 11, 2012 1:46 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: CompressionCodec in MapReduce

Thanks for you reply! That clears some thing up
There is but one problem... My CompressionCodec has to be instantiated on a per-file basis,
meaning it needs to know the name of the file it is to compress/decompress. I'm guessing that
would not be possible with the current implementation?

Or if so, how would I proceed with injecting it with the file name?
--
Greg

W dniu 2012-04-11 10:12, Zizon Qiu pisze:
append your custom codec full class name in "io.compression.codecs" either in mapred-site.xml
or in the configuration object pass to Job constructor.

the map reduce framework will try to guess the compress algorithm using the input files suffix.

if any CompressionCodec.getDefaultExtension() register in the configuration match the suffix,hadoop
will try to instantiate the codec and decompress for you ,if succeed,automatically.

the default value for "io.compression.codecs" is "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"

On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <sawtyss@student.agh.edu.pl<mailto:sawtyss@student.agh.edu.pl>>
wrote:
Hello,
I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but I haven't
found a way to inject it during the reading of input data, or during the write of the job
results.
Am I missing something, or is there no support for compressed files in the filesystem?

I am well aware of how to set it up to work during the intermitent phases of the MapReduce
operation, but I just can't find a way to apply it BEFORE the job takes place...
Is there any other way except simply uncompressing the files I need prior to scheduling a
job?

Huge thanks for any help you can give me!
--
Greg



Mime
View raw message