hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elton sky <eltonsky9...@gmail.com>
Subject Re: Why mergeParts() is not parallel with collect() on map?
Date Tue, 03 May 2011 08:48:44 GMT
Pls correct me if I am wrong. One of the important assumptions of hadoop map
reduce is: map's output should be smaller than input. So the workload on
reduce should be smaller than map phase. That's why we put sort, spill and
merge all on map side. Reduce just merge sorted output.


> However, typically, the map's merge is much less intensive than the
> reduce's merge. As a result, this might just bloat the code for little gain,
> except in the most extreme cases.

In some cases, if the output of map is bigger than input, there might be
many spill files to be merged.


On Tue, May 3, 2011 at 5:52 PM, Arun C Murthy <acm@yahoo-inc.com> wrote:

> Elton,
>
>
> On May 2, 2011, at 11:30 PM, elton sky wrote:
>
>  In shuffle phase, reduce copies output from map. In parallel, there are
>> InMemoryMerger and OnDiskMerger merge copied files if too many. But on
>> map,
>> the mergeParts*() *happens only after collect() finished. Why don't we
>> parallel spills merging with collect()/sort&spill on map?
>>
>
> Certainly feasible, please feel free to open a jira for the enhancement.
>
> However, typically, the map's merge is much less intensive than the
> reduce's merge. As a result, this might just bloat the code for little gain,
> except in the most extreme cases.
>
> Arun
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message