crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Cogroup sort order
Date Wed, 03 Aug 2016 14:51:35 GMT
Hi David,

I take it you're referring to the ordering of the two Collections
returned within the value Pair of a cogroup result?

As you probably know, there isn't any kind of guaranteed ordering on
these collections, although I would expect that given the same input
and cluster layout, it's perfectly possible that you would get the
same iteration order on the results each time.

However, there are also probably quite a few underlying factors which
could change the iteration order on these Collections; for example,
just having a different number of partitions used by the reducers, or
different settings which would influence when spills are done during
the shuffle phase (assuming we're talking about MR-based Crunch here)
could influence the iteration order of the collections. Note that
these are things that impact overall ordering of output in MapReduce
itself, and nothing specific to Crunch.

- Gabriel


On Wed, Aug 3, 2016 at 4:40 PM, David Ortiz <dpo5003@gmail.com> wrote:
> Hey everyone,
>
>       Just curious based on something I'm seeing as we move a job around
> between different ec2 cluster types.  Does the underlying architecture of
> the system have an effect on the sort order in a cogroup?  It's looking like
> moving from the cc2 architecture we were using to an m4 based system, that
> our job output changes.  The changes I am seeing line up with the order in
> which the iterator returns records being different, so was curious.
>
> Thanks,
>      Dave

Mime
View raw message