hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Williams, Ken" <Ken.Willi...@windlogics.com>
Subject Streaming jobs getting poor locality
Date Wed, 22 Jan 2014 19:10:23 GMT
Hi,

I posted a question to Stack Overflow yesterday about an issue I'm seeing, but judging by
the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like
I should switch venue.  I'm pasting the same question here in hopes of finding someone with
interest.

Original SO post is at http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality
.

*****************
I have some fairly simple Hadoop streaming jobs that look like this:

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
  -files hdfs:///apps/local/count.pl \
  -input /foo/data/bz2 \
  -output /user/me/myoutput \
  -mapper "cut -f4,8 -d," \
  -reducer count.pl \
  -combiner count.pl

The count.pl script is just a simple script that accumulates counts in a hash and prints them
out at the end - the details are probably not relevant but I can post it if necessary.

The input is a directory containing 5 files encoded with bz2 compression, roughly the same
size as each other, for a total of about 5GB (compressed).

When I look at the running job, it has 45 mappers, but they're all running on one node. The
particular node changes from run to run, but always only one node. Therefore I'm achieving
poor data locality as data is transferred over the network to this node, and probably achieving
poor CPU usage too.

The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for
all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.

I'm happy to share any requested info from my configuration, but this is a corporate cluster
and I don't want to upload any full config files.

It looks like this previous thread [ why map task always running on a single node - http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node
] is relevant but not conclusive.

*****************

Thanks.

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com


________________________________

CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution of any kind is strictly prohibited. If you are not the intended recipient,
please contact the sender via reply e-mail and destroy all copies of the original message.
Thank you.

Mime
View raw message