incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-133) Add Aggregator support for combineValues ops on secondary keys via maps and collections
Date Sun, 23 Dec 2012 20:58:13 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539093#comment-13539093
] 

Josh Wills commented on CRUNCH-133:
-----------------------------------

Gabriel, thanks for the review. I agree w/the issues you point out, and that the complexity
this patch introduces isn't clearly worth the benefit.

We could go back to returning AggregatorFactory instances instead of Aggregator instances
from the factory methods, but again, that imposes a cognitive cost that may not be worthwhile
for the one use case. it would seem simpler to me to limit the collections() and maps() aggregator
methods to taking in AggregatorFactory instances and leaving everything else (i.e., the general
case) alone. This is one of those times where Java's lack of first-class functions is a real
pain. :)
                
> Add Aggregator support for combineValues ops on secondary keys via maps and collections
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-133
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-133
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Josh Wills
>         Attachments: CRUNCH-133.patch
>
>
> Sawzall has a neat trick where you can do aggregations on secondary keys via maps, which
is useful in cases where you might want to aggregate some data at (for example) both a country
and at a city level within a single MapReduce job. We had a thread on crunch-user about this
pattern:
> http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201212.mbox/%3CCAH29n6O-aHXTPHCRpSuAkAGUjvDR%3D56%3D-OLq9K9mZje%2BwVB4-Q%40mail.gmail.com%3E
> The pattern ends up looking something like this:
> // Define a table that has long values at both the K and the <K, String> levels.
> PTable<K, Pair<Long, Map<String, Long>>> in = ...;
> // Define and apply an Aggregator that can handle sums at both levels within a single
MR job.
> Aggregator<Pair<Long, Map<String, Long>>> a = pairAggregator(SUM_LONGS(),
map(Aggregators.SUM_LONGS()));
> PTable<K, Pair<Long, Map<String, Long>>> out = in.groupByKey().combineValues(a);
> ...which would run substantially faster than executing two dependent MR jobs, one that
did the city aggregation and then a second follow-up job that did the country aggregation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message