Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DCAA310B77 for ; Thu, 23 Jan 2014 15:05:35 +0000 (UTC) Received: (qmail 49340 invoked by uid 500); 23 Jan 2014 15:05:27 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 49023 invoked by uid 500); 23 Jan 2014 15:05:26 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 49012 invoked by uid 99); 23 Jan 2014 15:05:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 15:05:25 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of java8964@hotmail.com designates 65.55.90.100 as permitted sender) Received: from [65.55.90.100] (HELO snt0-omc2-s25.snt0.hotmail.com) (65.55.90.100) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 15:05:19 +0000 Received: from SNT149-W62 ([65.55.90.73]) by snt0-omc2-s25.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 23 Jan 2014 07:04:57 -0800 X-TMN: [Q4HbTPn2+EBXIeVEWI9davp0cVVB8woyU5M21EwcJY8=] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_5b59fa8d-69b6-42a4-87d6-e11584191381_" From: java8964 To: "user@hadoop.apache.org" Subject: RE: Streaming jobs getting poor locality Date: Thu, 23 Jan 2014 10:04:57 -0500 Importance: Normal In-Reply-To: References: , MIME-Version: 1.0 X-OriginalArrivalTime: 23 Jan 2014 15:04:57.0931 (UTC) FILETIME=[7B2001B0:01CF184C] X-Virus-Checked: Checked by ClamAV on apache.org --_5b59fa8d-69b6-42a4-87d6-e11584191381_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable I believe Hadoop can figure out the codec from the file name extension=2C a= nd Bzip2 codec is supported from Hadoop as Java implementation=2C which is = also a SplitableCompressionCodec. So 5G bzip2 files generate about 45 mappers is very reasonable=2C assuming = 128M/block. The question is why ONLY one node will run this 45 mappers. What described = in the original question is not very clear.=20 I am not very familiar with the streaming and yarn (It looks like you are s= uing MRV2). So why do you think all the mappers running on one node? Did so= meone else run other Jobs in the cluster at the same time? What are the mem= ory allocation and configuration in your cluster on each nodes? Yong Date: Thu=2C 23 Jan 2014 15:12:54 +0530 Subject: Re: Streaming jobs getting poor locality From: sudhakara.st@gmail.com To: user@hadoop.apache.org I think In order to configure a Hadoop Job to read the Compressed input you= have to specify compression codec in code or in command linelike=20 -D io.compression.codecs=3Dorg.apache.hadoop.io.compress.BZip2Codec=20 =0A= On Thu=2C Jan 23=2C 2014 at 12:40 AM=2C Williams=2C Ken wrote: =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= Hi=2C=0A= =0A= I posted a question to Stack Overflow yesterday about an issue I=92m seeing= =2C but judging by the low interest (only 7 views in 24 hours=2C and 3 of t= hem are probably me! :-) it seems like I should switch venue. I=92m pastin= g the same question=0A= here in hopes of finding someone with interest.=0A= =0A= Original SO post is at =0A= http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locali= ty .=0A= =0A= *****************=0A= I have some fairly simple Hadoop streaming jobs that look like this:=0A= =0A= yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \= =0A= -files hdfs:///apps/local/count.pl \=0A= -input /foo/data/bz2 \=0A= -output /user/me/myoutput \=0A= -mapper "cut -f4=2C8 -d=2C" \=0A= -reducer count.pl \=0A= -combiner count.pl=0A= =0A= The count.pl script is just a simple script that accumulates counts in a ha= sh and prints them out at the end - the details are probably not relevant b= ut I can post it if necessary.=0A= =0A= =0A= The input is a directory containing 5 files encoded with bz2 compression=2C= roughly the same size as each other=2C for a total of about 5GB (compresse= d).=0A= =0A= When I look at the running job=2C it has 45 mappers=2C but they're all runn= ing on one node. The particular node changes from run to run=2C but always = only one node. Therefore I'm achieving poor data locality as data is transf= erred over the network=0A= to this node=2C and probably achieving poor CPU usage too.=0A= =0A= The entire cluster has 9 nodes=2C all the same basic configuration. The blo= cks of the data for all 5 files are spread out among the 9 nodes=2C as repo= rted by the HDFS Name Node web UI.=0A= =0A= I'm happy to share any requested info from my configuration=2C but this is = a corporate cluster and I don't want to upload any full config files.=0A= =0A= It looks like this previous thread [ why map task always running on a singl= e node -=0A= =0A= http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-= a-single-node ] is relevant but not conclusive.=0A= =0A= *****************=0A= =0A= Thanks.=0A= =0A= --=0A= Ken Williams=2C Senior Research Scientist=0A= WindLogics=0A= http://windlogics.com=0A= =0A= =0A= =0A= =0A= =0A= CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the inte= nded recipient(s) and may contain confidential and privileged information. = Any unauthorized review=2C use=2C disclosure or distribution of any kind is= strictly prohibited. If you are not=0A= the intended recipient=2C please contact the sender via reply e-mail and d= estroy all copies of the original message. Thank you. =0A= =0A= =0A= =0A= --=20 =20 Regards=2C...Sudhakara.st =0A= =20 =0A= = --_5b59fa8d-69b6-42a4-87d6-e11584191381_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
I believe Hadoop can figure out = the codec from the file name extension=2C and Bzip2 codec is supported from= Hadoop as Java implementation=2C which is also a SplitableCompressionCodec= .

So 5G bzip2 files generate about 45 mappers is very re= asonable=2C assuming 128M/block.

The question is w= hy ONLY one node will run this 45 mappers. What described in the original q= uestion is not very clear. =3B

I am not very f= amiliar with the streaming and yarn (It looks like you are suing MRV2). So = why do you think all the mappers running on one node? Did someone else run = other Jobs in the cluster at the same time? What are the memory allocation = and configuration in your cluster  =3Bon each nodes?

Yong


Date: Thu=2C 23 Jan = 2014 15:12:54 +0530
Subject: Re: Streaming jobs getting poor localityFrom: sudhakara.st@gmail.com
To: user@hadoop.apache.org

I think In order to configure a Hadoop Job to read the Compre= ssed input you have to specify compression codec in code or in command line= like
-D io.compression.codecs=3Dorg.apache.hadoop.io.compress.BZip2C= odec
=0A=


On Thu=2C Jan 23=2C 2014 at 12:40 AM=2C Williams=2C Ken <=3B= Ken.Williams@windlogics.com>=3B wrote:
=0A=
=0A= =0A= =0A= =0A= =0A=
=0A=
=0A=

Hi=2C

=0A=

 =3B

=0A=

I posted a question to Stack Overflow yesterday a= bout an issue I=92m seeing=2C but judging by the low interest (only 7 views= in 24 hours=2C and 3 of them are probably me! :-) it seems like I should s= witch venue. =3B I=92m pasting the same question=0A= here in hopes of finding someone with interest.

=0A=

 =3B

=0A=

Original SO post is at =0A= http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locali= ty .

=0A=

 =3B

=0A=

*****************

=0A=

I have some fairly simple Hadoop streaming jobs t= hat look like this:

=0A=

 =3B

=0A=

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streami= ng-2.2.0.2.0.6.0-101.jar \

=0A=

 =3B -files hdfs:///apps/local/count.pl \

=0A=

 =3B -input /foo/data/bz2 \

=0A=

 =3B -output /user/me/myoutput \

=0A=

 =3B -mapper "cut -f4=2C8 -d=2C" \

=0A=

 =3B -reducer count.pl \

=0A=

 =3B -combiner count.pl

=0A=

 =3B

=0A=

The count.pl script is just a simple script that accumulates counts in a h= ash and prints them out at the end - the details are probably not relevant = but I can post it if necessary.

=0A= =0A=

 =3B

=0A=

The input is a directory containing 5 files encod= ed with bz2 compression=2C roughly the same size as each other=2C for a tot= al of about 5GB (compressed).

=0A=

 =3B

=0A=

When I look at the running job=2C it has 45 mappe= rs=2C but they're all running on one node. The particular node changes from= run to run=2C but always only one node. Therefore I'm achieving poor data = locality as data is transferred over the network=0A= to this node=2C and probably achieving poor CPU usage too.

=0A=

 =3B

=0A=

The entire cluster has 9 nodes=2C all the same ba= sic configuration. The blocks of the data for all 5 files are spread out am= ong the 9 nodes=2C as reported by the HDFS Name Node web UI.

=0A=

 =3B

=0A=

I'm happy to share any requested info from my con= figuration=2C but this is a corporate cluster and I don't want to upload an= y full config files.

=0A=

 =3B

=0A=

It looks like this previous thread [ why map task= always running on a single node -=0A= =0A= http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-= a-single-node ] is relevant but not conclusive.

=0A=

 =3B

=0A=

*****************

=0A=

 =3B

=0A=

Thanks.

=0A=

 =3B

=0A=

--

=0A=

Ken Williams=2C Senior Research Scientist

=0A=

WindLogics=

=0A=

http://windlogics.com

=0A=

 =3B

=0A=
=0A=
=0A=
=0A=
=0A= CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the inte= nded recipient(s) and may contain confidential and privileged information. = Any unauthorized review=2C use=2C disclosure or distribution of any kind is= strictly prohibited. If you are not=0A= the intended recipient=2C please contact the sender via reply e-mail and d= estroy all copies of the original message. Thank you.
=0A=
=0A=
=0A= =0A=



--
 = =3B  =3B  =3B =3B
Regards=2C
...Sudhakara.st
=0A=  =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B &nb= sp=3B  =3B  =3B =3B
=0A=
= --_5b59fa8d-69b6-42a4-87d6-e11584191381_--