hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <huan....@accenture.com>
Subject RE: what is the major difference between Hadoop and cloudMapReduce?
Date Fri, 04 Dec 2009 01:19:54 GMT
Todd,

Yes, the difference is that, if special ordering is required, we hold a few keys' worth of
values in memory v.s. Hadoop holding one key's worth of values. 

SQS has no limit on the number of queues you can create and the number of messages you can
put in a queue. In practice, we have run an experiment in which we created 4,000 queues with
100 threads, and it finished successfully in a few seconds. Have not tried to stress test
to millions of queues though. But even if the test fails with millions of queues, it will
be a bug with SQS to be fixed since Amazon advertises that there is no limit. Thanks.

-Huan

> -----Original Message-----
> From: Todd Lipcon [mailto:todd@cloudera.com]
> Sent: Thursday, December 03, 2009 9:45 AM
> To: general@hadoop.apache.org
> Subject: Re: what is the major difference between Hadoop and
> cloudMapReduce?
> 
> On Thu, Dec 3, 2009 at 9:39 AM, <huan.liu@accenture.com> wrote:
> 
> > > -----Original Message-----
> > > From: Todd Lipcon [mailto:todd@cloudera.com]
> > > Sent: Monday, November 30, 2009 8:15 AM
> > > To: general@hadoop.apache.org
> > > Subject: Re: what is the major difference between Hadoop and
> > > cloudMapReduce?
> > >
> > > On Mon, Nov 30, 2009 at 1:48 AM, <huan.liu@accenture.com> wrote:
> > >
> > > > Todd,
> > > >
> > > > We do not keep all values for a key in memory. Instead, we only
> keep
> > > the
> > > > partial reduce result in memory, but throw away the value as soon
> as
> > > it is
> > > > used. The point you raised is still very valid if the reduce
> state
> > > > maintained per key is large, which I hope is a very rare case. If
> you
> > > have
> > > > some concrete workload examples, it will help us prioritize the
> > > development
> > > > effort. I can definitely see the benefits of introducing a paging
> > > mechanism
> > > > to spill partial reduce results to the output SQS queue in the
> future.
> > > > Thanks.
> > > >
> > >
> > > Hi Huan,
> > >
> > > I guess I misremembered or misread the paper.
> > >
> > > Given this technique, doesn't it mean that reducers can only work
> when
> > > commutative and associative?
> > >
> > > -Todd
> >
> > Todd,
> >
> > I do not see how it is different from Hadoop's iterator interface,
> unless
> > the reduce function relies on the fact that the values are sorted in
> a
> > particular order when fed by the iterator one at a time.
> >
> > If there is no assumption on the value ordering, or the ordering
> expected
> > is different from what the iterator presents, the reduce function has
> to
> > read in all values from the iterator first (page to disk if
> necessary),
> > rearrange them as necessary, then process based on that new ordering.
> This
> > will be the same as what we will do in our iterator interface. In the
> next()
> > function, our reduce function can read in all values from the
> iterator (page
> > to disk if necessary), then in the finish() function, our reduce
> function
> > rearranges the ordering and process based on the new ordering.
> >
> >
> If you want sorted values, you can get that in Hadoop, though it's not
> on by
> default.
> 
> Also, the reducer only needs to keep all the values for a single key in
> RAM
> in this case. In your case, since the keys come in any order, the
> reducer
> would have to keep all the values for every key in that partition in
> RAM. I
> guess you're suggesting that you could have enough partitions that
> "every
> key in that partition" is only one or two, but for large datasets this
> just
> doesn't soudn feasible (can you get millions of SQS queues?)
> 
> Sure, you can implement your own paging to disk, and then an external
> sort,
> and then read them back in the more convenient sorted order. But then
> you're
> just implementing what Hadoop already does for you :)
> 
> -Todd
> 
> 
> >
> > This message is for the designated recipient only and may contain
> > privileged, proprietary, or otherwise private information.  If you
> have
> > received it in error, please notify the sender immediately and delete
> the
> > original.  Any other use of the email by you is prohibited.
> >


This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise private information.  If you have received it in error, please notify the sender
immediately and delete the original.  Any other use of the email by you is prohibited.

Mime
View raw message