hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: On the topic of task scheduling
Date Sun, 02 Sep 2012 18:46:47 GMT
Vasco,

 Welcome to Hadoop!

 You observations are all correct - in simplest case you launch all reduces up front (we used
to do that initially) and get a good 'pipeline' between maps, shuffle (i.e. moving map-outputs
to reduces) and the reduce itself.

 However, one thing to remember is that keeping reduces up and running without sufficient
maps being completed is a waste of resources in the cluster. As a result, we have a simple
heuristic in hadoop-1 i.e. do not launch reduces until a certain percentage of the job's maps
are complete - by default it's set to 5%. However, there still is a flaw with it (regardless
of what you set it to be i.e. 5% or 50%). If it's too high, you lose the 'pipeline' and too
low (5%), reduces still spin waiting for all maps to complete wasting resources in the cluster.

 Given that, we've implemented the heuristic you've described below for hadoop-2 which is
better at balancing resource-utilization v/s pipelining or job latency.

 However, as you've pointed out there are several improvements which are feasible. But, remember
that the complexity involved has on a number of factors you've already mentioned:
 # Job size (a job with 100m/10r v/s 100000m/10000r)
 # Skew for reduces
 # Resource availability i.e. other active jobs/shuffles in the system, network bandwidth
etc.

 If you look at an ideal shuffle it will look so (pardon my primitive scribble):
 http://people.apache.org/~acmurthy/ideal-shuffle.png

 From that graph:
 # X i.e. when to launch reduces depends on resource availability, job size & maps' completion
rate.
 # Slope of shuffles (red worm) depends on network b/w, skew etc.

 None of your points are invalid - I'm just pointing out the possibilities and complexities.

 Your points about aggregation are also valid, look at http://code.google.com/p/sailfish/
for e.g.

 One of the advantages of hadoop-2 is that anyone can play with these heuristics and implement
your own - I'd love to help if you are interested in playing with them.

 Related jiras:
 https://issues.apache.org/jira/browse/MAPREDUCE-4584

hth,
Arun

On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:

> Hi,
> 
> I am new to the list, I am working with hadoop in the context of my
> MSc graduation project (has nothing to do with task scheduling per
> se). I came across task scheduling because I ran into the fifo
> starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
> the fifo starvation issue is solved. The behavior of task scheduling I
> observe in this branch is as follows. It begins with all containers
> allocated to mappers. Pretty quickly reducers are starting to be
> scheduled. In a linear way more containers are given to reducers,
> until about 50% (does anybody know why 50%?) of available containers
> are reducers (this point is reached when ~ 50% of the mappers are
> finished). It stays ~50-50 for until all mappers are scheduled. Only
> then the proportion of containers allocated to reducers is increased
> to > 50%.
> 
> I don't think this is in general quite the optimal (in terms of total
> job completion time) scheduling behavior. The reason being that the
> last reducer can only be scheduled when a free container becomes
> available after all mappers are scheduled. Thus, in order to shorten
> total job completion time the last reducer must be scheduled as early
> as possible.
> 
> For the following gedankenexperiment, assume # reducer is set to 99%
> capacity, as suggested somewhere in the hadoop docs, and that each
> reducer will process roughly the same amount of work. I am going to
> schedule as in 2.1.0, but instead of allocating reducers slowly up to
> 50 % of capacity, I am just going to take away containers. Thus, the
> amount of map work is the same as in 2.1.0, only no reduce work will
> be done. At the point that the proportion of reducers would increased
> to more than 50% of the containers (i.e., near the end of the map
> phase), I schedule all reducers in the containers I took away, making
> sure that the last reducer is scheduled at the same moment as it would
> be in 2.1.0.  My claim is that the job completion time of this
> hypothetical scheduling is about the same as the scheduling in 2.1.0
> (as the last reducer is scheduled at the same time), even though I
> took away 50% of the available resources for a large part of the job!
> The conclusion is that it would be better to allocate all available
> containers to mappers, and that reducers are starting to be scheduled
> when the map phase is nearing its end, instead of right at the
> beginning of the job.
> 
> Scheduling reducers early seems to me the way to go only when: 1) the
> output from mappers is very skewed, i.e., some reducers are expected
> to need much more time than others, 2) the network connection between
> nodes is (expected to be) a big bottleneck, i.e., schedule reducers
> early to smear out data transfer over the lifetime of a job, or 3)
> there is no contention for resource containers.
> 
> with regard to point 1: skewedness can be determined by looking at
> relative sizes of partitioned mapper output.
> 
> with regard to point 2: I think the network is only a bottleneck if it
> feeds tuples slower than the reducer can merge sort the tuples (am I
> right?). Also, it might be a nice optimization to transfer the
> intermediate data to the machine that is going/likely to run a
> specific reducer before the reducer is actually ran there (e.g.,
> something like a per machine prefetch manager?). A per machine task
> scheduling queue would be needed for this, to determine where a
> reducer is going/likely to be scheduled.
> 
> Just my two cents. I'm interested in hearing opinions on this matter.
> 
> Regards, Vasco

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message