From Leonardo Urbina <lurb...@mit.edu>
Subject Processing compressed files in Hadoop
Date Wed, 08 Feb 2012 17:39:54 GMT
Hello everyone,

I run a daily job that takes files in a variety of different formats and
process them using several custom InputFormats which are specified using
MultipleInputs. The results get aggregated into a single SequenceFile.
Furthermore this SequenceFile is used as part of the input for the next
day's job. I run all of this in Amazon's EMR. Now, I would like to be able
to use compression in order to save on storage, however after looking
around online I have hit some dead ends:

1) I would like to compress my input files, and Hadoop gives me three
choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
they cannot be made splittable. LZO on the other hand can be indexed,
however as far as I could tell, I would be forced to use LzoTextInputFormat
in order to get Hadoop to properly decompress and read the files. Most of
my input cannot use TextInputFormat (my inputs include multi-line records,
XML files, among other things). My question is, is it possible to use LZO
with custom InputFormats?

2) I am also interested in compressing the output SequenceFile. I know this
can be done by setting

FileOutputFormat.setCompressOutput(conf, true)

If I were using TextOutputFormat, the output would be a gzipped text file.
However, being a SequenceFile it seems to be internally compressed and the
compression scheme is not immediately apparent to me. Is it possible to
specify LZO as the compression? Also, since I will be using the output as
part of the next input, do I need to index the output as a separate task?
And finally, when I specify the input format for the next day (and this
goes back to my first question), what InputFormat should I specify? I
haven't been able to find something like LzoSequenceInputFormat or anything
of the like.

Am I missing something? Any help would be greatly appreciated. Best,

Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics

