Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0352F10306 for ; Wed, 12 Jun 2013 04:54:49 +0000 (UTC) Received: (qmail 92768 invoked by uid 500); 12 Jun 2013 04:54:40 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 92663 invoked by uid 500); 12 Jun 2013 04:54:39 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 92655 invoked by uid 99); 12 Jun 2013 04:54:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jun 2013 04:54:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rahul.rec.dgp@gmail.com designates 209.85.212.43 as permitted sender) Received: from [209.85.212.43] (HELO mail-vb0-f43.google.com) (209.85.212.43) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jun 2013 04:54:33 +0000 Received: by mail-vb0-f43.google.com with SMTP id e12so3297211vbg.2 for ; Tue, 11 Jun 2013 21:54:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=ReqP7FmqDCBWH9dIOkysQdNfr5wiNV0Fzw+M2mqWd04=; b=RrzX8vjjzfAaljXNdAilrpMfAfi605aBaRTSKjs7APhrnqQfNCwokJKQXwBdDY/umG iE7K+4OQ2a6/3bjuK1QYzCraM2qb1qcMteezpLRAdz8GoHhjoNcxcdu0gFsHOH+54QHF QUd2mbPjeDymeBzaTjth5zYDjUH+Yz3eKzEuqIVKZQVzB7SMCJdCScbzO7zIdY3mk+Cm vaoW1bKdTAGfLKPT+4NPGcDaDI+BoYpj5zhGQJIPhBEvkvrK2Z+PvODOZHsKUiZs7Wpy qHDYPLN5XfcphwKThg7BEdpGIyhxvwDsj797jfh/pF+FT0sP4GNaCgZt/jqCLLhvDfQH kgjQ== X-Received: by 10.220.185.136 with SMTP id co8mr9113348vcb.25.1371012853058; Tue, 11 Jun 2013 21:54:13 -0700 (PDT) MIME-Version: 1.0 Received: by 10.59.12.67 with HTTP; Tue, 11 Jun 2013 21:53:52 -0700 (PDT) In-Reply-To: References: From: Rahul Bhattacharjee Date: Wed, 12 Jun 2013 10:23:52 +0530 Message-ID: Subject: Re: Now give .gz file as input to the MAP To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11c1bb2a16f30604deedce5b X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1bb2a16f30604deedce5b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Nothing special is required for process .gz files using MR. however , as Sanjay mentioned , verify the codec's configured in core-site and another thing to note is that these files are not splittable. You might want to use bz2 , these are splittable. Thanks, Rahul On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian < Sanjay.Subramanian@wizecommerce.com> wrote: > hadoopConf.set("mapreduce.job.inputformat.class", > "com.wizecommerce.utils.mapred.TextInputFormat"); > > hadoopConf.set("mapreduce.job.outputformat.class", > "com.wizecommerce.utils.mapred.TextOutputFormat"); > No special settings required for reading Gzip except these above > > I u want to output Gzip > > hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true"); > > hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", > "org.apache.hadoop.io.compress.GzipCodec"); > > Make sure Gzip codec is defined in core-site.xml > > > io.compression.codecs > >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.De= faultCodec value> > > > I have a question > > Why are u using GZIP as input to Map ? These are not splittable=E2=80=A6= Unless u > have to read multilines (like lines between a BEGIN and END block in a lo= g > file) and send it as one record to the mapper > > Also in Non-splitable Snappy Codec is better > > Good Luck > > > sanjay > > From: samir das mohapatra > Reply-To: "user@hadoop.apache.org" > Date: Tuesday, June 11, 2013 9:07 PM > To: "cdh-user@cloudera.com" , " > user@hadoop.apache.org" , " > user-help@hadoop.apache.org" > Subject: Now give .gz file as input to the MAP > > Hi All, > Did any one worked on, how to pass the .gz file as file input for > mapreduce job ? > > Regards, > samir. > > CONFIDENTIALITY NOTICE > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > This email message and any attachments are for the exclusive use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution is > prohibited. If you are not the intended recipient, please contact the > sender by reply email and destroy all copies of the original message alon= g > with any attachments, from your computer system. If you are the intended > recipient, please be advised that the content of this message is subject = to > access, review and disclosure by the sender's Email System Administrator. > --001a11c1bb2a16f30604deedce5b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Nothing special is required for process .gz files u= sing MR. however , as Sanjay mentioned , verify the codec's configured = in core-site and another thing to note is that these files are not splittab= le.

You might want to use bz2 , these are splittable.

T= hanks,
Rahul


On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <= Sa= njay.Subramanian@wizecommerce.com> wrote:

hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");

hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");

No special settings required for reading Gzip except these above=C2=A0=

I u want to output Gzip=C2=A0

hadoopConf.set("mapreduce.output.fileoutputformat.compress&q= uot;, "true");

hadoopConf.set("mapreduce.output.fileoutputformat.compress.c= odec", "org.apache.hadoop.io.compress.GzipCodec");

=C2=A0
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
=C2=A0=C2=A0=C2=A0=C2=A0<name>io.compression.c= odecs</name>
=C2=A0=C2=A0=C2=A0=C2=A0<value>org.apache.hado= op.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>

I have a question

Why are u using GZIP as input to Map ? These are not splittable=E2=80= =A6Unless u have to read multilines (like lines between a BEGIN and END blo= ck in a log file) and send it as one record to the mapper

Also in Non-splitable Snappy Codec is better

Good Luck


sanjay=C2=A0

From: samir das mohapatra <samir.helpdoc@gmail= .com>
Reply-To: "user@hadoop.apache.org" &= lt;user@hadoop.= apache.org>
Date: Tuesday, June 11, 2013 9:07 P= M
To: "cdh-user@cloudera.com" <cdh-user@cloudera.com= >, "user@hadoop.apache.org" <user@hadoop.apache.org>, "use= r-help@hadoop.apache.org" <user-help@hadoop.apache.org>
Subject: Now give .gz file as input= to the MAP

Hi All,
=C2=A0=C2=A0=C2=A0 Did any one worked on, how to pass the .gz file as= =C2=A0 file input for mapreduce job ?
=C2=A0
Regards,
samir.

CONFIDENTIALITY NOTICE
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
This email message and any attachments are for the exclusive use of the int= ended recipient(s) and may contain confidential and privileged information.= Any unauthorized review, use, disclosure or distribution is prohibited. If= you are not the intended recipient, please contact the sender by reply email and destroy all copies of the ori= ginal message along with any attachments, from your computer system. If you= are the intended recipient, please be advised that the content of this mes= sage is subject to access, review and disclosure by the sender's Email System Administrator.

--001a11c1bb2a16f30604deedce5b--