Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 67482 invoked from network); 24 Jun 2009 16:41:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Jun 2009 16:41:42 -0000 Received: (qmail 68254 invoked by uid 500); 24 Jun 2009 16:41:50 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 68168 invoked by uid 500); 24 Jun 2009 16:41:50 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 68149 invoked by uid 99); 24 Jun 2009 16:41:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 16:41:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [12.110.209.161] (HELO usausmgw01.spansion.com) (12.110.209.161) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 16:41:38 +0000 X-IronPort-AV: E=McAfee;i="5300,2777,5656"; a="5170916" Received: from usausexbh1.spansion.com ([10.248.26.58]) by usausmgw01.spansion.com with ESMTP; 24 Jun 2009 09:41:16 -0700 Received: from USAUSEXMBPF2.spansion.com ([10.248.26.56]) by USAUSEXBH1.spansion.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 24 Jun 2009 11:41:17 -0500 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable X-MimeOLE: Produced By Microsoft Exchange V6.5 Subject: RE: Are .bz2 extensions supported in Hadoop 18.3 Date: Wed, 24 Jun 2009 11:41:16 -0500 Message-ID: In-Reply-To: <4A4246D6.5090109@opera.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Are .bz2 extensions supported in Hadoop 18.3 thread-index: Acn04VAqOTdLcJi+Q7iDvMvU9M3qxwABYQWw References: <4A4203E7.5060602@opera.com> <314098690906240647u70faeb24gd8f786f11d9c70bd@mail.gmail.com> <4A42418B.8010004@opera.com> <4A4246D6.5090109@opera.com> From: "Gross, Danny" To: X-OriginalArrivalTime: 24 Jun 2009 16:41:17.0144 (UTC) FILETIME=[984C6D80:01C9F4EA] X-Virus-Checked: Checked by ClamAV on apache.org Hi Usman, I believe your issue is specifically in the contrib/ hadoop-streaming.jar. I ran a test python job with hadoop-streaming.jar on a bz2 file with no errors. However, the output was junk. Pig has no issue with bz2 files. According to http://hadoop.apache.org/core/docs/r0.15.2/streaming.html, streaming.jar reads the file line-by-line. It seems that you would have to specify another plugin via the -inputformat for your app in order for hadoop-streaming.jar to properly handle the compressed file. I'm not much help beyond that. Perhaps you might try Pig (this is such a cool platform!) Hope it helps. Best regards, Danny -----Original Message----- From: Usman Waheed [mailto:usmanw@opera.com]=20 Sent: Wednesday, June 24, 2009 10:32 AM To: core-user@hadoop.apache.org Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 Hi Danny, Hmmm makes me wonder that i might be doing something wrong here. I=20 imported just one .bz2 files into HDFS and then launched a map/reduce=20 tasks executing the following command: /home/hadoop/hadoop/bin/hadoop jar=20 /home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input=20 /user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl=20 -file map.pl -reducer reduce.pl -file reduce.pl -jobconf=20 mapred.reduce.tasks=3D1* The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the=20 final output part-00000 in /user/hadoop/out1 was meaningless. I was=20 expecting key,value pairs but all i got was a count integer for example: 31,006, no errors were generated at all. When i ran the same command above with uncompressed file my output was=20 fine giving me the correct key,value pairs. No errors were generated. Noted below is my map.pl and reduce.pl. Thanks for your help, Usman _*map.pl*_ #!/usr/bin/perl -w # # while () { chomp; next if ( ! /^\d+/ ); my @fields =3D split(/;/); my $cookie =3D $fields[11]; print "$cookie\t1\n"; } _*reduce.pl*_ #!/usr/bin/perl -w # # while () { chomp; ($key,$value) =3D split(/\t/); $count{$key} +=3D $value; =20 } foreach $k (keys %count) { $c =3D $count{$k}; print "$k\t$c\n"; } > Hi Usman, > > I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2 > files. My experiments with these files have been through Pig. Hope > this is useful to you. > > Best regards, > > Danny Gross > > -----Original Message----- > From: Usman Waheed [mailto:usmanw@opera.com]=20 > Sent: Wednesday, June 24, 2009 10:09 AM > To: core-user@hadoop.apache.org > Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 > > The version (18.3) i am running in my cluster is the tar ball i got from > > hadoop.apache.org. > So you are suggesting to use the Cloudera 18.3 which supports bzip2 > correct? > > Thanks, > Usman > > =20 >> I believe the cloudera 18.3 supports bzip2 >> >> On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed >> =20 > wrote: > =20 >> =20 >> =20 >>> Hi All, >>> >>> Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? >>> I tried but interestingly the output was not what i expected versus >>> =20 > what i > =20 >>> got when my data was in uncompressed format. >>> >>> Thanks, >>> Usman >>> >>> =20 >>> =20 >> >> =20 >> =20