crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-133) Add Aggregator support for combineValues ops on secondary keys via maps and collections
Date Sun, 23 Dec 2012 09:04:12 GMT


Gabriel Reid commented on CRUNCH-133:

Sorry for the super slow uptake on this Josh. I started looking at this last week, and it
took a while to grok it, and by then I was out of time to really review it -- and then it
just didn't happen. Thanks for the reminder anyhow.

I've added some comments (or actually repeated the same comment over and over) on RB, but
I also wanted to comment on the general approach here.

I can get that this is a use case that could be useful, but (in what seems to be a common
thread for me lately) I'm a bit hesitant to add the arity and especially the copy methods
to Aggregator, as these seem very specific to this use case, and it seems a shame to add two
methods to what is otherwise a pretty simple and clean interface for this one use case.

I was thinking (although I'm not totally sure) that the Aggregator interface could possibly
be split out into two interfaces, something like UnaryAggregator (which doesn't include an
arity method, and could probably just be called Aggregator) and an ArityNAggregator, which
would have an arity method (and would need a better name than ArityNAggregator). The CollectionAggregator
and MapAggregator would then only use UnaryAggregators. Anyhow, it's just an idea, and I might
be missing some important details.

I'm actually more worried about the copy method, because it feels like that can be a risky/tricky
one to do properly. As I mentioned on RB, I think that a call to copy on a decorating Aggregator
should probably call copy on the underlying Aggregators instead of using the same reference.
Even then, I worry about issues around transient state that comes from Configurations being
properly copied in implementations of copy, etc.

As an alternative, how about just using an AggregatorFactory (or UnaryAggregatorFactory) in
the MapAggregator and CollectionAggregator? Then there's no need to call copy everywhere.

> Add Aggregator support for combineValues ops on secondary keys via maps and collections
> ---------------------------------------------------------------------------------------
>                 Key: CRUNCH-133
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Josh Wills
>         Attachments: CRUNCH-133.patch
> Sawzall has a neat trick where you can do aggregations on secondary keys via maps, which
is useful in cases where you might want to aggregate some data at (for example) both a country
and at a city level within a single MapReduce job. We had a thread on crunch-user about this
> The pattern ends up looking something like this:
> // Define a table that has long values at both the K and the <K, String> levels.
> PTable<K, Pair<Long, Map<String, Long>>> in = ...;
> // Define and apply an Aggregator that can handle sums at both levels within a single
MR job.
> Aggregator<Pair<Long, Map<String, Long>>> a = pairAggregator(SUM_LONGS(),
> PTable<K, Pair<Long, Map<String, Long>>> out = in.groupByKey().combineValues(a);
> ...which would run substantially faster than executing two dependent MR jobs, one that
did the city aggregation and then a second follow-up job that did the country aggregation.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message