crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <>
Subject RE: Cogroup sort order
Date Wed, 03 Aug 2016 14:53:37 GMT

     Thanks.  That's exactly what I was talking about.  I get the same results when I run
multiple times on the same cluster with same input (to be expected since it's based on the
MR framework's sort), but it was when we went to an entirely new cluster architecture (different
hardware/layout) that there were some changes.  I suspect this is causing the slight mismatches,
but wanted to make sure that was a rational thought.



-----Original Message-----
From: Gabriel Reid []
Sent: Wednesday, August 03, 2016 10:52 AM
Subject: Re: Cogroup sort order

Hi David,

I take it you're referring to the ordering of the two Collections returned within the value
Pair of a cogroup result?

As you probably know, there isn't any kind of guaranteed ordering on these collections, although
I would expect that given the same input and cluster layout, it's perfectly possible that
you would get the same iteration order on the results each time.

However, there are also probably quite a few underlying factors which could change the iteration
order on these Collections; for example, just having a different number of partitions used
by the reducers, or different settings which would influence when spills are done during the
shuffle phase (assuming we're talking about MR-based Crunch here) could influence the iteration
order of the collections. Note that these are things that impact overall ordering of output
in MapReduce itself, and nothing specific to Crunch.

- Gabriel

On Wed, Aug 3, 2016 at 4:40 PM, David Ortiz <> wrote:
> Hey everyone,
>       Just curious based on something I'm seeing as we move a job
> around between different ec2 cluster types.  Does the underlying
> architecture of the system have an effect on the sort order in a
> cogroup?  It's looking like moving from the cc2 architecture we were
> using to an m4 based system, that our job output changes.  The changes
> I am seeing line up with the order in which the iterator returns records being different,
so was curious.
> Thanks,
>      Dave
This email is intended only for the use of the individual(s) to whom it is addressed. If you
have received this communication in error, please immediately notify the sender and delete
the original email.
View raw message