hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Lu <...@apache.org>
Subject Re: What happens when you have fewer input files than mapper slots?
Date Thu, 21 Mar 2013 18:44:25 GMT
> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?

You're right in expecting that the tasks of the small job will likely be
evenly distributed among 20 nodes, if the 20 files are evenly distributed
among the nodes and that there are free slots on every node.

> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.

Are you seeing Job B tasks are not being evenly distributed to each node?
You can check the locations of the files by hadoop fsck. If the evenness is
the goal you can also write your own input format that return empty
locations for each split and read the small files in map task directly. If
you're using Hadoop 1.0.x and fair scheduler, you might need to set
mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.


View raw message