hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Shine <Dave.Sh...@channelintelligence.com>
Subject RE: Why mergeParts() is not parallel with collect() on map?
Date Tue, 03 May 2011 12:29:23 GMT
I'm a relative newbie to Hadoop, but your assumption below is not correct in my organization.
 It is common for us to call output.collect() more than once in a map() function.

Dave Shine


-----Original Message-----
From: elton sky [mailto:eltonsky9404@gmail.com]
Sent: Tuesday, May 03, 2011 4:49 AM
To: common-dev@hadoop.apache.org
Subject: Re: Why mergeParts() is not parallel with collect() on map?

Pls correct me if I am wrong. One of the important assumptions of hadoop map
reduce is: map's output should be smaller than input. So the workload on
reduce should be smaller than map phase. That's why we put sort, spill and
merge all on map side. Reduce just merge sorted output.


> However, typically, the map's merge is much less intensive than the
> reduce's merge. As a result, this might just bloat the code for little gain,
> except in the most extreme cases.

In some cases, if the output of map is bigger than input, there might be
many spill files to be merged.


On Tue, May 3, 2011 at 5:52 PM, Arun C Murthy <acm@yahoo-inc.com> wrote:

> Elton,
>
>
> On May 2, 2011, at 11:30 PM, elton sky wrote:
>
>  In shuffle phase, reduce copies output from map. In parallel, there are
>> InMemoryMerger and OnDiskMerger merge copied files if too many. But on
>> map,
>> the mergeParts*() *happens only after collect() finished. Why don't we
>> parallel spills merging with collect()/sort&spill on map?
>>
>
> Certainly feasible, please feel free to open a jira for the enhancement.
>
> However, typically, the map's merge is much less intensive than the
> reduce's merge. As a result, this might just bloat the code for little gain,
> except in the most extreme cases.
>
> Arun
>
>
>

The information contained in this email message is considered confidential and proprietary
to the sender and is intended solely for review and use by the named recipient. Any unauthorized
review, use or distribution is strictly prohibited. If you have received this message in error,
please advise the sender by reply email and delete the message.

Mime
View raw message