hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: CompressionCodec in MapReduce
Date Wed, 11 Apr 2012 18:40:06 GMT
You can write your own InputFormat (IF) which extends FileInputFormat.

In your IF you get the InputSplit which has the filename during the call to getRecordReader.
That is the hook you are looking for.

More details here:


On Apr 11, 2012, at 2:53 PM, Grzegorz Gunia wrote:

> I think we misunderstood here.
> I'll base my question upon an example:
> Lets say I want each of the files stored on my hdfs to be encrypted prior to being physically
stored on the cluster.
> For that I'll write a custom CompressionCodec, that performs the encryption, and use
it during any edits/creations of files in the HDFS.
> Then to make it more secure I'll make it so it uses different keys for different files,
and supply the keys to the codec during its instantiation.
> Now I'd like to do a MapReduce job on those files. That would require instantiating the
codec, and supplying it with the filename, to determine the key used. Is it possible to do
so with the current implementation of Hadoop?
> --
> Greg
> W dniu 2012-04-11 10:44, Zizon Qiu pisze:
>> If your are:
>> 1. using TextInputFormat.
>> 2.all input files are ends with certain suffix like ".gz"
>> 3.the custom CompressionCodec already register  in configuration and getDefaultExtension
return the same suffix like as describe in 2.
>> the nothing else you need to do.
>> hadoop will deal with it automatically.
>> that means the input key&value in map method are already decompress.
>> But,if the origin files dose not end with certain suffix,you need to write your own
inputformat or subclass TextInputFormat , override the createRecordReader method which return
your own RecordReader.
>> the InputSplit pass to the InputFormat is actually FileInputSplit,which you can retrieve
the input file path.
>> you may also take a look at the isSplitable method declared in InputFormat,if your
files are not splitable.
>> for more detail,refer to the TextInputFormat class implementation.
>> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia <sawtyss@student.agh.edu.pl>
>> Thanks for you reply! That clears some thing up
>> There is but one problem... My CompressionCodec has to be instantiated on a per-file
basis, meaning it needs to know the name of the file it is to compress/decompress. I'm guessing
that would not be possible with the current implementation?
>> Or if so, how would I proceed with injecting it with the file name?
>> --
>> Greg
>> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>>> append your custom codec full class name in "io.compression.codecs" either in
mapred-site.xml or in the configuration object pass to Job constructor.
>>> the map reduce framework will try to guess the compress algorithm using the input
files suffix.
>>> if any CompressionCodec.getDefaultExtension() register in the configuration match
the suffix,hadoop will try to instantiate the codec and decompress for you ,if succeed,automatically.
>>> the default value for "io.compression.codecs" is "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <sawtyss@student.agh.edu.pl>
>>> Hello,
>>> I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but
I haven't found a way to inject it during the reading of input data, or during the write of
the job results.
>>> Am I missing something, or is there no support for compressed files in the filesystem?
>>> I am well aware of how to set it up to work during the intermitent phases of
the MapReduce operation, but I just can't find a way to apply it BEFORE the job takes place...
>>> Is there any other way except simply uncompressing the files I need prior to
scheduling a job?
>>> Huge thanks for any help you can give me!
>>> --
>>> Greg

Arun C. Murthy
Hortonworks Inc.

View raw message