hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Streaming jobs getting poor locality
Date Thu, 23 Jan 2014 20:35:55 GMT
I cannot explain it (Your configuration looks fine to me, and you mention that those mappers
can ONLY run on one node in one run, but could be on different nodes across running). But
as I said, I am not an expect in Yarn, as it is also very new to me. Let's see if someone
else in the list can give you some hints.
Meantime, maybe you can do some tests to help us narrow the cause, like:
Since you just need the 'cut' in your mapper
1) Run an example job, like 'Pi' or 'Write output' coming from the hadoop-example jar, do
the mapper tasks running concurrently on multi nodes in your cluster?2) If you don't use bzip2
file as input, do you have the same problem for other type files, like plain text file?
Yong

From: Ken.Williams@windlogics.com
To: user@hadoop.apache.org
Subject: RE: Streaming jobs getting poor locality
Date: Thu, 23 Jan 2014 16:06:03 +0000








java8964 wrote:
 
> I believe Hadoop can figure out the codec from the file name extension, and Bzip2 codec
is
> supported from Hadoop as Java implementation, which is also a SplitableCompressionCodec.
> So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 128M/block.
 
Correct - the number of splits seems reasonable to me too, and the codec is indeed figured
out automatically.  The job does produce the correct output.
 
> The question is why ONLY one node will run this 45 mappers.

 
Exactly.
 
> I am not very familiar with the streaming and yarn (It looks like you are suing MRV2).
So
> why do you think all the mappers running on one node? Did someone else run other Jobs
in the
> cluster at the same time? What are the memory allocation and configuration in your cluster
>  on each nodes?
 
1) I am using Hadoop 2.2.0, 



2) I know they’re all running on one node because in the Ambari interface (we’re using
the free Hortonworks distro) I can see that all the Map jobs are assigned to the same IP address.
 I can confirm it by SSH-ing to that node and I see all the jobs running
 there, using 'top' or 'ps' or whatever.  If I SSH to any other node, I see no jobs running.
 
3) There are no other jobs running at the same time on the cluster.
 
4) All data nodes are configured identically, and the config includes:
    resourcemanager_heapsize = 1024
    nodemanager_heapsize = 1024
    yarn_heapsize = 1024
    yarn.nodemanager.resource.memory-mb = 98403
    yarn.nodemanager.vmem-pmem-ratio = 2.1
    yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
    yarn.scheduler.minimum-allocation-mb = 512
    yarn.scheduler.maximum-allocation-mb = 10240
 
 
-Ken
 



From: Williams, Ken


Sent: Wednesday, January 22, 2014 1:10 PM

To: 'user@hadoop.apache.org'

Subject: Streaming jobs getting poor locality


 
Hi,
 
I posted a question to Stack Overflow yesterday about an issue I’m seeing, but judging by
the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like
I should switch venue.  I’m pasting the same question
 here in hopes of finding someone with interest.
 
Original SO post is at 
http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality .
 
*****************
I have some fairly simple Hadoop streaming jobs that look like this:
 
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
  -files hdfs:///apps/local/count.pl \
  -input /foo/data/bz2 \
  -output /user/me/myoutput \
  -mapper "cut -f4,8 -d," \
  -reducer count.pl \
  -combiner count.pl
 
The count.pl script is just a simple script that accumulates counts in a hash and prints them
out at the end - the details are probably not relevant but I can post it if necessary.
 
The input is a directory containing 5 files encoded with bz2 compression, roughly the same
size as each other, for a total of about 5GB (compressed).
 
When I look at the running job, it has 45 mappers, but they're all running on one node. The
particular node changes from run to run, but always only one node. Therefore I'm achieving
poor data locality as data is transferred over the network
 to this node, and probably achieving poor CPU usage too.
 
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for
all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
 
I'm happy to share any requested info from my configuration, but this is a corporate cluster
and I don't want to upload any full config files.
 
It looks like this previous thread [ why map task always running on a single node -

http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node ]
is relevant but not conclusive.
 
*****************
 







CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution of any kind is strictly prohibited. If you are not
 the intended recipient, please contact the sender via reply e-mail and destroy all copies
of the original message. Thank you.

 		 	   		  
Mime
View raw message