hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Streaming jobs getting poor locality
Date Thu, 23 Jan 2014 15:04:57 GMT
I believe Hadoop can figure out the codec from the file name extension, and Bzip2 codec is
supported from Hadoop as Java implementation, which is also a SplitableCompressionCodec.
So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 128M/block.
The question is why ONLY one node will run this 45 mappers. What described in the original
question is not very clear. 
I am not very familiar with the streaming and yarn (It looks like you are suing MRV2). So
why do you think all the mappers running on one node? Did someone else run other Jobs in the
cluster at the same time? What are the memory allocation and configuration in your cluster
 on each nodes?
Date: Thu, 23 Jan 2014 15:12:54 +0530
Subject: Re: Streaming jobs getting poor locality
From: sudhakara.st@gmail.com
To: user@hadoop.apache.org

I think In order to configure a Hadoop Job to read the Compressed input you have to specify
compression codec in code or in command linelike 
-D io.compression.codecs=org.apache.hadoop.io.compress.BZip2Codec 

On Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken <Ken.Williams@windlogics.com> wrote:

I posted a question to Stack Overflow yesterday about an issue I’m seeing, but judging by
the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like
I should switch venue.  I’m pasting the same question
 here in hopes of finding someone with interest.
Original SO post is at 
http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality .
I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming- \
  -files hdfs:///apps/local/count.pl \
  -input /foo/data/bz2 \
  -output /user/me/myoutput \
  -mapper "cut -f4,8 -d," \
  -reducer count.pl \
  -combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them
out at the end - the details are probably not relevant but I can post it if necessary.

The input is a directory containing 5 files encoded with bz2 compression, roughly the same
size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The
particular node changes from run to run, but always only one node. Therefore I'm achieving
poor data locality as data is transferred over the network
 to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for
all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster
and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node -

http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node ]
is relevant but not conclusive.
Ken Williams, Senior Research Scientist

CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution of any kind is strictly prohibited. If you are not
 the intended recipient, please contact the sender via reply e-mail and destroy all copies
of the original message. Thank you.



View raw message