hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zizon Qiu <zzd...@gmail.com>
Subject Re: CompressionCodec in MapReduce
Date Wed, 11 Apr 2012 14:05:18 GMT
It is possible but a little tricky.

As I mention before,write a custom InputFormat and the associate

On Wed, Apr 11, 2012 at 5:23 PM, Grzegorz Gunia

>  I think we misunderstood here.
> I'll base my question upon an example:
> Lets say I want each of the files stored on my hdfs to be encrypted prior
> to being physically stored on the cluster.
> For that I'll write a custom CompressionCodec, that performs the
> encryption, and use it during any edits/creations of files in the HDFS.
> Then to make it more secure I'll make it so it uses different keys for
> different files, and supply the keys to the codec during its instantiation.
> Now I'd like to do a MapReduce job on those files. That would require
> instantiating the codec, and supplying it with the filename, to determine
> the key used. Is it possible to do so with the current implementation of
> Hadoop?
> --
> Greg
> W dniu 2012-04-11 10:44, Zizon Qiu pisze:
> If your are:
> 1. using TextInputFormat.
> 2.all input files are ends with certain suffix like ".gz"
> 3.the custom CompressionCodec already register  in configuration and
> getDefaultExtension return the same suffix like as describe in 2.
>  the nothing else you need to do.
> hadoop will deal with it automatically.
>  that means the input key&value in map method are already decompress.
>  But,if the origin files dose not end with certain suffix,you need
> to write your own inputformat or subclass TextInputFormat , override the
> createRecordReader method which return your own RecordReader.
> the InputSplit pass to the InputFormat is actually FileInputSplit,which
> you can retrieve the input file path.
>  you may also take a look at the isSplitable method declared
> in InputFormat,if your files are not splitable.
>  for more detail,refer to the TextInputFormat class implementation.
> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia <
> sawtyss@student.agh.edu.pl> wrote:
>>  Thanks for you reply! That clears some thing up
>> There is but one problem... My CompressionCodec has to be instantiated on
>> a per-file basis, meaning it needs to know the name of the file it is to
>> compress/decompress. I'm guessing that would not be possible with the
>> current implementation?
>> Or if so, how would I proceed with injecting it with the file name?
>> --
>> Greg
>> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>> append your custom codec full class name in "io.compression.codecs"
>> either in mapred-site.xml or in the configuration object pass to Job
>> constructor.
>>  the map reduce framework will try to guess the compress algorithm using
>> the input files suffix.
>>  if any CompressionCodec.getDefaultExtension() register in the
>> configuration match the suffix,hadoop will try to instantiate the codec and
>> decompress for you ,if succeed,automatically.
>>  the default value for "io.compression.codecs" is
>> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <
>> sawtyss@student.agh.edu.pl> wrote:
>>> Hello,
>>> I am trying to apply a custom CompressionCodec to work with MapReduce
>>> jobs, but I haven't found a way to inject it during the reading of input
>>> data, or during the write of the job results.
>>> Am I missing something, or is there no support for compressed files in
>>> the filesystem?
>>> I am well aware of how to set it up to work during the intermitent
>>> phases of the MapReduce operation, but I just can't find a way to apply it
>>> BEFORE the job takes place...
>>> Is there any other way except simply uncompressing the files I need
>>> prior to scheduling a job?
>>> Huge thanks for any help you can give me!
>>> --
>>> Greg

View raw message