hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amr Awadallah <...@cloudera.com>
Subject Re: Need help understanding the source
Date Tue, 07 Jul 2009 07:30:17 GMT
To add to Todd/Ted's wise words, the Hadoop (and MapReduce) architects 
didn't impose this limitation just for fun, it is very core to enabling 
Hadoop to be as reliable as it is. If the reducer starts processing 
mapper output immediately and a specific mapper fails then the reducer 
would have to know how to undo the specific pieces of work related to 
the failed mapper, not trivial at all. That said, the combiners do 
achieve a bit of that for you, as they start working immediately on the 
map out, but on a per-mapper basis (not global), so easy to handle 
failure in that case (you just redo that mapper and the combining for it).

-- amr

Ted Dunning wrote:
> I would consider this to be a very delicate optimization with little utility
> in the real world.  It is very, very rare to reliably know how many records
> the reducer will see.  Getting this wrong would be a disaster.  Getting it
> right would be very difficult in almost all cases.
> Moreover, this assumption is baked all through the map-reduce design and
> thus doing a change to allow reduce to go ahead is likely to be really
> tricky (not that I know this for a fact).
> On Mon, Jul 6, 2009 at 11:14 AM, Naresh Rapolu <nareshreddy.rapolu@gmail.com
>> wrote:
>> My aim is to make the reduce move ahead with reduction as and when it gets
>> the data required, instead of waiting for all the maps to complete.  If it
>> knows how many records it needs and compares it with number of records it
>> has got until now,  it can move on once they become equal without waiting
>> for all the maps to finish.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message