hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy.had...@gmail.com>
Subject Re: Processing compressed files in Hadoop
Date Wed, 08 Feb 2012 18:33:12 GMT
Hi Leo
       You can index the LZO files as

//Run theLZO indexer on files in hdfs
LzoIndexer indexer = new LzoIndexer(fs.getConf());


On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg <Tim.Broberg@exar.com> wrote:

> Leo, splittable bzip is available
>  ...in versions > 0.21 - https://issues.apache.org/jira/browse/HADOOP-4012
>  ...or as a patch for 1.0.0, to be included in 1.1.0 -
> https://issues.apache.org/jira/browse/HADOOP-7823
> There is a 48-bit signature in the bzip header, and they search for this
> at all bit alignments.
> It's not fast, but it's there.
>    - Tim.
> ________________________________________
> From: flechadeorion@gmail.com [flechadeorion@gmail.com] On Behalf Of
> Leonardo Urbina [lurbina@mit.edu]
> Sent: Wednesday, February 08, 2012 9:39 AM
> To: common-user@hadoop.apache.org
> Subject: Processing compressed files in Hadoop
> Hello everyone,
> I run a daily job that takes files in a variety of different formats and
> process them using several custom InputFormats which are specified using
> MultipleInputs. The results get aggregated into a single SequenceFile.
> Furthermore this SequenceFile is used as part of the input for the next
> day's job. I run all of this in Amazon's EMR. Now, I would like to be able
> to use compression in order to save on storage, however after looking
> around online I have hit some dead ends:
> 1) I would like to compress my input files, and Hadoop gives me three
> choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
> they cannot be made splittable. LZO on the other hand can be indexed,
> however as far as I could tell, I would be forced to use LzoTextInputFormat
> in order to get Hadoop to properly decompress and read the files. Most of
> my input cannot use TextInputFormat (my inputs include multi-line records,
> XML files, among other things). My question is, is it possible to use LZO
> with custom InputFormats?
> 2) I am also interested in compressing the output SequenceFile. I know this
> can be done by setting
> FileOutputFormat.setCompressOutput(conf, true)
> If I were using TextOutputFormat, the output would be a gzipped text file.
> However, being a SequenceFile it seems to be internally compressed and the
> compression scheme is not immediately apparent to me. Is it possible to
> specify LZO as the compression? Also, since I will be using the output as
> part of the next input, do I need to index the output as a separate task?
> And finally, when I specify the input format for the next day (and this
> goes back to my first question), what InputFormat should I specify? I
> haven't been able to find something like LzoSequenceInputFormat or anything
> of the like.
> Am I missing something? Any help would be greatly appreciated. Best,
> -Leo
> --
> Leo Urbina
> Massachusetts Institute of Technology
> Department of Electrical Engineering and Computer Science
> Department of Mathematics
> lurbina@mit.edu
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message