Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BC6710216 for ; Thu, 23 Jan 2014 09:43:40 +0000 (UTC) Received: (qmail 17213 invoked by uid 500); 23 Jan 2014 09:43:30 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 17073 invoked by uid 500); 23 Jan 2014 09:43:26 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 17056 invoked by uid 99); 23 Jan 2014 09:43:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 09:43:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sudhakara.st@gmail.com designates 209.85.216.178 as permitted sender) Received: from [209.85.216.178] (HELO mail-qc0-f178.google.com) (209.85.216.178) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 09:43:14 +0000 Received: by mail-qc0-f178.google.com with SMTP id m20so2096334qcx.23 for ; Thu, 23 Jan 2014 01:42:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=clCdjK39VieoECnGTafS3usUdPjFHSz9hmQ/v93gJcI=; b=feg7rN70iRTCXP5kpFYlEGh0NdUAAWPnHG3tne4qyPAyQO+1qRMxRySbCjEIcYmcB+ ggcb15Ah/Ua2AFLs6IJRtUG0ydKcPjtVZJuLM49B7v8jL9a+bLYCsaea0UH9roqRuQhm IxfO6K/tTE1+d2cZYUlLOTDMhEkVcbnxXknjiK1XcYUiEm/moKDv0EdOkmiNqBYqaQGP LzpRA8gE0xhC1vjVUrwmEXV7oQU0C9ISCLjpoDn/BVH+0F0/vPwHCb+89OWpdHDzUfVf CvJdyJlvZpskNwZ+shMBrqCFMS7nP8fLsxY8+no84yGHzoX6YKQL1Fz8l9qceVK3ojsl fFqQ== MIME-Version: 1.0 X-Received: by 10.224.136.195 with SMTP id s3mr9767727qat.95.1390470174220; Thu, 23 Jan 2014 01:42:54 -0800 (PST) Received: by 10.229.14.132 with HTTP; Thu, 23 Jan 2014 01:42:54 -0800 (PST) In-Reply-To: References: Date: Thu, 23 Jan 2014 15:12:54 +0530 Message-ID: Subject: Re: Streaming jobs getting poor locality From: sudhakara st To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c2c974ce27fd04f0a0103d X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2c974ce27fd04f0a0103d Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I think In order to configure a Hadoop Job to read the Compressed input you have to specify compression codec in code or in command linelike *-D io.compression.codecs=3Dorg.apache.hadoop.io.compress.BZip2Codec * On Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken wrote: > Hi, > > > > I posted a question to Stack Overflow yesterday about an issue I=92m seei= ng, > but judging by the low interest (only 7 views in 24 hours, and 3 of them > are probably me! :-) it seems like I should switch venue. I=92m pasting = the > same question here in hopes of finding someone with interest. > > > > Original SO post is at > http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-loca= lity. > > > > ***************** > > I have some fairly simple Hadoop streaming jobs that look like this: > > > > yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar= \ > > -files hdfs:///apps/local/count.pl \ > > -input /foo/data/bz2 \ > > -output /user/me/myoutput \ > > -mapper "cut -f4,8 -d," \ > > -reducer count.pl \ > > -combiner count.pl > > > > The count.pl script is just a simple script that accumulates counts in a > hash and prints them out at the end - the details are probably not releva= nt > but I can post it if necessary. > > > > The input is a directory containing 5 files encoded with bz2 compression, > roughly the same size as each other, for a total of about 5GB (compressed= ). > > > > When I look at the running job, it has 45 mappers, but they're all runnin= g > on one node. The particular node changes from run to run, but always only > one node. Therefore I'm achieving poor data locality as data is transferr= ed > over the network to this node, and probably achieving poor CPU usage too. > > > > The entire cluster has 9 nodes, all the same basic configuration. The > blocks of the data for all 5 files are spread out among the 9 nodes, as > reported by the HDFS Name Node web UI. > > > > I'm happy to share any requested info from my configuration, but this is = a > corporate cluster and I don't want to upload any full config files. > > > > It looks like this previous thread [ why map task always running on a > single node - > http://stackoverflow.com/questions/12135949/why-map-task-always-running-o= n-a-single-node] is relevant but not conclusive. > > > > ***************** > > > > Thanks. > > > > -- > > Ken Williams, Senior Research Scientist > > *Wind**Logics* > > http://windlogics.com > > > > ------------------------------ > > CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution of > any kind is strictly prohibited. If you are not the intended recipient, > please contact the sender via reply e-mail and destroy all copies of the > original message. Thank you. > --=20 Regards, ...Sudhakara.st --001a11c2c974ce27fd04f0a0103d Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
I think In order to configure a Hadoop Job to read th= e Compressed input you have to specify compression codec in code or in comm= and linelike
-D io.compression.codecs=3Dorg.apache.hadoop.io.compres= s.BZip2Codec


O= n Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken <Ken.Williams@win= dlogics.com> wrote:

Hi,

=A0

I posted a question to Stack Overflow yesterday abou= t an issue I=92m seeing, but judging by the low interest (only 7 views in 2= 4 hours, and 3 of them are probably me! :-) it seems like I should switch v= enue.=A0 I=92m pasting the same question here in hopes of finding someone with interest.

=A0

Original SO post is at http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locali= ty .

=A0

*****************

I have some fairly simple Hadoop streaming jobs that= look like this:

=A0

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-= 2.2.0.2.0.6.0-101.jar \

=A0 -files hdfs:///apps/local/count.pl \

=A0 -input /foo/data/bz2 \

=A0 -output /user/me/myoutput \

=A0 -mapper "cut -f4,8 -d," \

=A0 -reducer count.pl \

=A0 -combiner count.pl

=A0

The co= unt.pl script is just a simple script that accumulates counts in a hash= and prints them out at the end - the details are probably not relevant but= I can post it if necessary.

=A0

The input is a directory containing 5 files encoded = with bz2 compression, roughly the same size as each other, for a total of a= bout 5GB (compressed).

=A0

When I look at the running job, it has 45 mappers, b= ut they're all running on one node. The particular node changes from ru= n to run, but always only one node. Therefore I'm achieving poor data l= ocality as data is transferred over the network to this node, and probably achieving poor CPU usage too.

=A0

The entire cluster has 9 nodes, all the same basic c= onfiguration. The blocks of the data for all 5 files are spread out among t= he 9 nodes, as reported by the HDFS Name Node web UI.

=A0

I'm happy to share any requested info from my co= nfiguration, but this is a corporate cluster and I don't want to upload= any full config files.

=A0

It looks like this previous thread [ why map task al= ways running on a single node - http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-= a-single-node ] is relevant but not conclusive.

=A0

*****************

=A0

Thanks.

=A0

--

Ken Williams, Senior Research Scientist

Wind<= /b>Logics<= /p>

= http://windlogics.com

=A0




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the inte= nded recipient(s) and may contain confidential and privileged information. = Any unauthorized review, use, disclosure or distribution of any kind is str= ictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and des= troy all copies of the original message. Thank you.



--
=A0 = =A0 =A0=A0
Regards,
...Sudhakara.st
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0
--001a11c2c974ce27fd04f0a0103d--