hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 博亮 <fenglian...@gmail.com>
Subject Re: Why mergeParts() is not parallel with collect() on map?
Date Tue, 03 May 2011 13:31:17 GMT
Elton,

I think the procedure of each map task including spill, sort and partition
can be processed in memory.  Thus the benefit of parallel is not obvious.
On the other hand, reduce should receive map output from several map tasks
across  the cluster.

Boliang


On Tue, May 3, 2011 at 8:46 PM, elton sky <eltonsky9404@gmail.com> wrote:

> Dave,
>
> you are right, collect() will be called whenever a [K,V] will be inserted
> into kvbuffer. Here, I mean when all [K,V] are created and the last
> collect() finishes :).
>
> But I think if map phase created bigger amount of output than input, we
> need
> some different procedure.
>
> On Tue, May 3, 2011 at 10:29 PM, Dave Shine <
> Dave.Shine@channelintelligence.com> wrote:
>
> > I'm a relative newbie to Hadoop, but your assumption below is not correct
> > in my organization.  It is common for us to call output.collect() more
> than
> > once in a map() function.
> >
> > Dave Shine
> >
> >
> > -----Original Message-----
> > From: elton sky [mailto:eltonsky9404@gmail.com]
> > Sent: Tuesday, May 03, 2011 4:49 AM
> > To: common-dev@hadoop.apache.org
> > Subject: Re: Why mergeParts() is not parallel with collect() on map?
> >
> > Pls correct me if I am wrong. One of the important assumptions of hadoop
> > map
> > reduce is: map's output should be smaller than input. So the workload on
> > reduce should be smaller than map phase. That's why we put sort, spill
> and
> > merge all on map side. Reduce just merge sorted output.
> >
> >
> > > However, typically, the map's merge is much less intensive than the
> > > reduce's merge. As a result, this might just bloat the code for little
> > gain,
> > > except in the most extreme cases.
> >
> > In some cases, if the output of map is bigger than input, there might be
> > many spill files to be merged.
> >
> >
> > On Tue, May 3, 2011 at 5:52 PM, Arun C Murthy <acm@yahoo-inc.com> wrote:
> >
> > > Elton,
> > >
> > >
> > > On May 2, 2011, at 11:30 PM, elton sky wrote:
> > >
> > >  In shuffle phase, reduce copies output from map. In parallel, there
> are
> > >> InMemoryMerger and OnDiskMerger merge copied files if too many. But on
> > >> map,
> > >> the mergeParts*() *happens only after collect() finished. Why don't we
> > >> parallel spills merging with collect()/sort&spill on map?
> > >>
> > >
> > > Certainly feasible, please feel free to open a jira for the
> enhancement.
> > >
> > > However, typically, the map's merge is much less intensive than the
> > > reduce's merge. As a result, this might just bloat the code for little
> > gain,
> > > except in the most extreme cases.
> > >
> > > Arun
> > >
> > >
> > >
> >
> > The information contained in this email message is considered
> confidential
> > and proprietary to the sender and is intended solely for review and use
> by
> > the named recipient. Any unauthorized review, use or distribution is
> > strictly prohibited. If you have received this message in error, please
> > advise the sender by reply email and delete the message.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message