hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elton sky <eltonsky9...@gmail.com>
Subject Re: Why mergeParts() is not parallel with collect() on map?
Date Tue, 03 May 2011 12:46:52 GMT
Dave,

you are right, collect() will be called whenever a [K,V] will be inserted
into kvbuffer. Here, I mean when all [K,V] are created and the last
collect() finishes :).

But I think if map phase created bigger amount of output than input, we need
some different procedure.

On Tue, May 3, 2011 at 10:29 PM, Dave Shine <
Dave.Shine@channelintelligence.com> wrote:

> I'm a relative newbie to Hadoop, but your assumption below is not correct
> in my organization.  It is common for us to call output.collect() more than
> once in a map() function.
>
> Dave Shine
>
>
> -----Original Message-----
> From: elton sky [mailto:eltonsky9404@gmail.com]
> Sent: Tuesday, May 03, 2011 4:49 AM
> To: common-dev@hadoop.apache.org
> Subject: Re: Why mergeParts() is not parallel with collect() on map?
>
> Pls correct me if I am wrong. One of the important assumptions of hadoop
> map
> reduce is: map's output should be smaller than input. So the workload on
> reduce should be smaller than map phase. That's why we put sort, spill and
> merge all on map side. Reduce just merge sorted output.
>
>
> > However, typically, the map's merge is much less intensive than the
> > reduce's merge. As a result, this might just bloat the code for little
> gain,
> > except in the most extreme cases.
>
> In some cases, if the output of map is bigger than input, there might be
> many spill files to be merged.
>
>
> On Tue, May 3, 2011 at 5:52 PM, Arun C Murthy <acm@yahoo-inc.com> wrote:
>
> > Elton,
> >
> >
> > On May 2, 2011, at 11:30 PM, elton sky wrote:
> >
> >  In shuffle phase, reduce copies output from map. In parallel, there are
> >> InMemoryMerger and OnDiskMerger merge copied files if too many. But on
> >> map,
> >> the mergeParts*() *happens only after collect() finished. Why don't we
> >> parallel spills merging with collect()/sort&spill on map?
> >>
> >
> > Certainly feasible, please feel free to open a jira for the enhancement.
> >
> > However, typically, the map's merge is much less intensive than the
> > reduce's merge. As a result, this might just bloat the code for little
> gain,
> > except in the most extreme cases.
> >
> > Arun
> >
> >
> >
>
> The information contained in this email message is considered confidential
> and proprietary to the sender and is intended solely for review and use by
> the named recipient. Any unauthorized review, use or distribution is
> strictly prohibited. If you have received this message in error, please
> advise the sender by reply email and delete the message.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message