hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: When a Reduce Task starts?
Date Tue, 04 Jan 2011 08:07:48 GMT

On Dec 23, 2010, at 9:20 PM, pig wrote:
> For some special reduce jobs that do not rely on the order of  
> (key,value) pairs,  the sort phase is of no use.
> In this situation, theoretically speaking, reduce can be started  
> before all of the map task finished.
> But why hadoop doesn't support this feature? For example, it may be  
> specified as an argument when committing a job.
>

Several reasons...

A major problem is errors - a map may fail after it's output has been  
'shuffled' by some reduces, not all (i.e. copied by some reduces). In  
this case, it's really hard to track and discard duplicate key/value  
pairs.

The behaviour you seek is quite easy to model by running map-only  
jobs, saving their output to HDFS and processing in the next job -  
albeit with some performance penalties. But, this keeps the MR  
framework very simple and stable.

Arun

Mime
View raw message