hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: what is the major difference between Hadoop and cloudMapReduce?
Date Thu, 03 Dec 2009 17:45:20 GMT
On Thu, Dec 3, 2009 at 9:39 AM, <huan.liu@accenture.com> wrote:

> > -----Original Message-----
> > From: Todd Lipcon [mailto:todd@cloudera.com]
> > Sent: Monday, November 30, 2009 8:15 AM
> > To: general@hadoop.apache.org
> > Subject: Re: what is the major difference between Hadoop and
> > cloudMapReduce?
> >
> > On Mon, Nov 30, 2009 at 1:48 AM, <huan.liu@accenture.com> wrote:
> >
> > > Todd,
> > >
> > > We do not keep all values for a key in memory. Instead, we only keep
> > the
> > > partial reduce result in memory, but throw away the value as soon as
> > it is
> > > used. The point you raised is still very valid if the reduce state
> > > maintained per key is large, which I hope is a very rare case. If you
> > have
> > > some concrete workload examples, it will help us prioritize the
> > development
> > > effort. I can definitely see the benefits of introducing a paging
> > mechanism
> > > to spill partial reduce results to the output SQS queue in the future.
> > > Thanks.
> > >
> >
> > Hi Huan,
> >
> > I guess I misremembered or misread the paper.
> >
> > Given this technique, doesn't it mean that reducers can only work when
> > commutative and associative?
> >
> > -Todd
> Todd,
> I do not see how it is different from Hadoop's iterator interface, unless
> the reduce function relies on the fact that the values are sorted in a
> particular order when fed by the iterator one at a time.
> If there is no assumption on the value ordering, or the ordering expected
> is different from what the iterator presents, the reduce function has to
> read in all values from the iterator first (page to disk if necessary),
> rearrange them as necessary, then process based on that new ordering. This
> will be the same as what we will do in our iterator interface. In the next()
> function, our reduce function can read in all values from the iterator (page
> to disk if necessary), then in the finish() function, our reduce function
> rearranges the ordering and process based on the new ordering.
If you want sorted values, you can get that in Hadoop, though it's not on by

Also, the reducer only needs to keep all the values for a single key in RAM
in this case. In your case, since the keys come in any order, the reducer
would have to keep all the values for every key in that partition in RAM. I
guess you're suggesting that you could have enough partitions that "every
key in that partition" is only one or two, but for large datasets this just
doesn't soudn feasible (can you get millions of SQS queues?)

Sure, you can implement your own paging to disk, and then an external sort,
and then read them back in the more convenient sorted order. But then you're
just implementing what Hadoop already does for you :)


> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise private information.  If you have
> received it in error, please notify the sender immediately and delete the
> original.  Any other use of the email by you is prohibited.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message