Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C2D879D06 for ; Wed, 8 Feb 2012 18:33:44 +0000 (UTC) Received: (qmail 25687 invoked by uid 500); 8 Feb 2012 18:33:41 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 25613 invoked by uid 500); 8 Feb 2012 18:33:40 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 25603 invoked by uid 99); 8 Feb 2012 18:33:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2012 18:33:40 +0000 X-ASF-Spam-Status: No, hits=4.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bejoy.hadoop@gmail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2012 18:33:33 +0000 Received: by dadp13 with SMTP id p13so867558dad.35 for ; Wed, 08 Feb 2012 10:33:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=2Lerzu70B6ny9pY1YYqO8kW129yRQzQ57EOmtaLju+M=; b=FBCZsl/QnTocmC+/4j7MQRwaqA/MzSQLwR4jzHz2qVlBxq72nsD7OosNOWZFSt3bYX K97Mtwfu8ZnZ4R0zetFswDPU3HuKguXAbny/zMzprcZhhjv4cswOxVnOOB6a0CNPn0am H+WHE67HqbJfWQRIEFTXyIbXAwiemY9ogbQZY= MIME-Version: 1.0 Received: by 10.68.222.131 with SMTP id qm3mr71513129pbc.34.1328725992123; Wed, 08 Feb 2012 10:33:12 -0800 (PST) Received: by 10.142.77.18 with HTTP; Wed, 8 Feb 2012 10:33:12 -0800 (PST) In-Reply-To: References: Date: Thu, 9 Feb 2012 00:03:12 +0530 Message-ID: Subject: Re: Processing compressed files in Hadoop From: Bejoy Ks To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b2ed6e1c3acbf04b87820f8 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b2ed6e1c3acbf04b87820f8 Content-Type: text/plain; charset=ISO-8859-1 Hi Leo You can index the LZO files as //Run theLZO indexer on files in hdfs LzoIndexer indexer = new LzoIndexer(fs.getConf()); indexer.index(filePath); Regards Bejoy.K.S On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg wrote: > Leo, splittable bzip is available > ...in versions > 0.21 - https://issues.apache.org/jira/browse/HADOOP-4012 > ...or as a patch for 1.0.0, to be included in 1.1.0 - > https://issues.apache.org/jira/browse/HADOOP-7823 > > There is a 48-bit signature in the bzip header, and they search for this > at all bit alignments. > > It's not fast, but it's there. > > - Tim. > > ________________________________________ > From: flechadeorion@gmail.com [flechadeorion@gmail.com] On Behalf Of > Leonardo Urbina [lurbina@mit.edu] > Sent: Wednesday, February 08, 2012 9:39 AM > To: common-user@hadoop.apache.org > Subject: Processing compressed files in Hadoop > > Hello everyone, > > I run a daily job that takes files in a variety of different formats and > process them using several custom InputFormats which are specified using > MultipleInputs. The results get aggregated into a single SequenceFile. > Furthermore this SequenceFile is used as part of the input for the next > day's job. I run all of this in Amazon's EMR. Now, I would like to be able > to use compression in order to save on storage, however after looking > around online I have hit some dead ends: > > 1) I would like to compress my input files, and Hadoop gives me three > choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as > they cannot be made splittable. LZO on the other hand can be indexed, > however as far as I could tell, I would be forced to use LzoTextInputFormat > in order to get Hadoop to properly decompress and read the files. Most of > my input cannot use TextInputFormat (my inputs include multi-line records, > XML files, among other things). My question is, is it possible to use LZO > with custom InputFormats? > > 2) I am also interested in compressing the output SequenceFile. I know this > can be done by setting > > FileOutputFormat.setCompressOutput(conf, true) > > If I were using TextOutputFormat, the output would be a gzipped text file. > However, being a SequenceFile it seems to be internally compressed and the > compression scheme is not immediately apparent to me. Is it possible to > specify LZO as the compression? Also, since I will be using the output as > part of the next input, do I need to index the output as a separate task? > And finally, when I specify the input format for the next day (and this > goes back to my first question), what InputFormat should I specify? I > haven't been able to find something like LzoSequenceInputFormat or anything > of the like. > > Am I missing something? Any help would be greatly appreciated. Best, > -Leo > > -- > Leo Urbina > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Department of Mathematics > lurbina@mit.edu > > The information and any attached documents contained in this message > may be confidential and/or legally privileged. The message is > intended solely for the addressee(s). If you are not the intended > recipient, you are hereby notified that any use, dissemination, or > reproduction is strictly prohibited and may be unlawful. If you are > not the intended recipient, please contact the sender immediately by > return e-mail and destroy all copies of the original message. > --047d7b2ed6e1c3acbf04b87820f8--