hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Williams, Ken" <Ken.Willi...@windlogics.com>
Subject RE: Streaming jobs getting poor locality
Date Thu, 23 Jan 2014 16:06:03 GMT
java8964 wrote:

> I believe Hadoop can figure out the codec from the file name extension, and Bzip2 codec
> supported from Hadoop as Java implementation, which is also a SplitableCompressionCodec.
> So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 128M/block.

Correct - the number of splits seems reasonable to me too, and the codec is indeed figured
out automatically.  The job does produce the correct output.

> The question is why ONLY one node will run this 45 mappers.


> I am not very familiar with the streaming and yarn (It looks like you are suing MRV2).
> why do you think all the mappers running on one node? Did someone else run other Jobs
in the
> cluster at the same time? What are the memory allocation and configuration in your cluster
>  on each nodes?

1) I am using Hadoop 2.2.0,

2) I know they're all running on one node because in the Ambari interface (we're using the
free Hortonworks distro) I can see that all the Map jobs are assigned to the same IP address.
 I can confirm it by SSH-ing to that node and I see all the jobs running there, using 'top'
or 'ps' or whatever.  If I SSH to any other node, I see no jobs running.

3) There are no other jobs running at the same time on the cluster.

4) All data nodes are configured identically, and the config includes:
    resourcemanager_heapsize = 1024
    nodemanager_heapsize = 1024
    yarn_heapsize = 1024
    yarn.nodemanager.resource.memory-mb = 98403
    yarn.nodemanager.vmem-pmem-ratio = 2.1
    yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
    yarn.scheduler.minimum-allocation-mb = 512
    yarn.scheduler.maximum-allocation-mb = 10240


From: Williams, Ken
Sent: Wednesday, January 22, 2014 1:10 PM
To: 'user@hadoop.apache.org'
Subject: Streaming jobs getting poor locality


I posted a question to Stack Overflow yesterday about an issue I'm seeing, but judging by
the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like
I should switch venue.  I'm pasting the same question here in hopes of finding someone with

Original SO post is at http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality

I have some fairly simple Hadoop streaming jobs that look like this:

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming- \
  -files hdfs:///apps/local/count.pl \
  -input /foo/data/bz2 \
  -output /user/me/myoutput \
  -mapper "cut -f4,8 -d," \
  -reducer count.pl \
  -combiner count.pl

The count.pl script is just a simple script that accumulates counts in a hash and prints them
out at the end - the details are probably not relevant but I can post it if necessary.

The input is a directory containing 5 files encoded with bz2 compression, roughly the same
size as each other, for a total of about 5GB (compressed).

When I look at the running job, it has 45 mappers, but they're all running on one node. The
particular node changes from run to run, but always only one node. Therefore I'm achieving
poor data locality as data is transferred over the network to this node, and probably achieving
poor CPU usage too.

The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for
all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.

I'm happy to share any requested info from my configuration, but this is a corporate cluster
and I don't want to upload any full config files.

It looks like this previous thread [ why map task always running on a single node - http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node
] is relevant but not conclusive.



CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution of any kind is strictly prohibited. If you are not the intended recipient,
please contact the sender via reply e-mail and destroy all copies of the original message.
Thank you.

View raw message