Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of bejoy.hadoop@gmail.com
 designates 209.85.210.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <FCA91A92EE52B041906A0358FC28FCC38F667B5E48@FRE1EXCH02.hq.exar.com>
References: 
 <CA+v5OK+UVxjPbZOhW6urZF8Z4J1i2uomM_VkZJMY8ogYbL5-HQ@mail.gmail.com>
	<FCA91A92EE52B041906A0358FC28FCC38F667B5E48@FRE1EXCH02.hq.exar.com>
Date: Thu, 9 Feb 2012 00:03:12 +0530
Message-ID: 
 <CACD21EO2mPFYGN7LZc044yK4P6rSnkFfSzCNo7Ur1dS7NqW+9g@mail.gmail.com>
Subject: Re: Processing compressed files in Hadoop
From: Bejoy Ks <bejoy.hadoop@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b2ed6e1c3acbf04b87820f8

--047d7b2ed6e1c3acbf04b87820f8
Content-Type: text/plain; charset=ISO-8859-1

Hi Leo
       You can index the LZO files as

//Run theLZO indexer on files in hdfs
LzoIndexer indexer = new LzoIndexer(fs.getConf());
indexer.index(filePath);

Regards
Bejoy.K.S

On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg <Tim.Broberg@exar.com> wrote:

> Leo, splittable bzip is available
>  ...in versions > 0.21 - https://issues.apache.org/jira/browse/HADOOP-4012
>  ...or as a patch for 1.0.0, to be included in 1.1.0 -
> https://issues.apache.org/jira/browse/HADOOP-7823
>
> There is a 48-bit signature in the bzip header, and they search for this
> at all bit alignments.
>
> It's not fast, but it's there.
>
>    - Tim.
>
> ________________________________________
> From: flechadeorion@gmail.com [flechadeorion@gmail.com] On Behalf Of
> Leonardo Urbina [lurbina@mit.edu]
> Sent: Wednesday, February 08, 2012 9:39 AM
> To: common-user@hadoop.apache.org
> Subject: Processing compressed files in Hadoop
>
> Hello everyone,
>
> I run a daily job that takes files in a variety of different formats and
> process them using several custom InputFormats which are specified using
> MultipleInputs. The results get aggregated into a single SequenceFile.
> Furthermore this SequenceFile is used as part of the input for the next
> day's job. I run all of this in Amazon's EMR. Now, I would like to be able
> to use compression in order to save on storage, however after looking
> around online I have hit some dead ends:
>
> 1) I would like to compress my input files, and Hadoop gives me three
> choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
> they cannot be made splittable. LZO on the other hand can be indexed,
> however as far as I could tell, I would be forced to use LzoTextInputFormat
> in order to get Hadoop to properly decompress and read the files. Most of
> my input cannot use TextInputFormat (my inputs include multi-line records,
> XML files, among other things). My question is, is it possible to use LZO
> with custom InputFormats?
>
> 2) I am also interested in compressing the output SequenceFile. I know this
> can be done by setting
>
> FileOutputFormat.setCompressOutput(conf, true)
>
> If I were using TextOutputFormat, the output would be a gzipped text file.
> However, being a SequenceFile it seems to be internally compressed and the
> compression scheme is not immediately apparent to me. Is it possible to
> specify LZO as the compression? Also, since I will be using the output as
> part of the next input, do I need to index the output as a separate task?
> And finally, when I specify the input format for the next day (and this
> goes back to my first question), what InputFormat should I specify? I
> haven't been able to find something like LzoSequenceInputFormat or anything
> of the like.
>
> Am I missing something? Any help would be greatly appreciated. Best,
> -Leo
>
> --
> Leo Urbina
> Massachusetts Institute of Technology
> Department of Electrical Engineering and Computer Science
> Department of Mathematics
> lurbina@mit.edu
>
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.
>

--047d7b2ed6e1c3acbf04b87820f8--