hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: How does a ReduceTask determine which MapTask output to read?
Date Wed, 29 Jun 2011 22:37:22 GMT
On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
> Hi,
>
> I was wondering what scheduling algorithm is used in Hadoop (version
> 0.20.2 in particular), for a ReduceTask to determine in what order it is
> supposed to read the map outputs from the various mappers that have been
> run? In particular, suppose we have 10maps called map1, map2,....,
> map10. and say 2 reducers r1 and r2. Which map's output does r1/r2 read
> from first?
>
> Also, suppose that the mapred.reduce.parallel.copies is set to 5. Then
> do both r1 and r2 read from 5 map outputs concurrently?
>
> Thanks,
> Virajith

You're missing 2 key steps in here.  After the mappers, a sort step gets 
run (to sort the records in key order) and then a partition step (to 
partition the records by key and spread them across the reducers).

So your question is really a moot one.  The records output by a given 
map step get spread across multiple reducers, and not all sent to a 
single reducer.

DR

Mime
View raw message