incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: CombineFn vs. Aggregator
Date Mon, 05 Nov 2012 03:17:45 GMT
I looked at the review board post for this, and it looks great-- much
cleaner than the old way.

I think my only objection was w/marking combineValues(CombineFn) as
deprecated, which struck me as premature. I'd be curious to hear the
reasoning on that.

Josh

On Sun, Nov 4, 2012 at 4:46 AM, Matthias Friedrich <matt@mafr.de> wrote:

> Hi,
>
> having looked at every single CombineFn and Aggregator in Crunch's
> code base and in light of CRUNCH-106 [1], I wonder if we could make
> things even easier to use. How about changing the Aggregator interface
> to this:
>
> public interface Aggregator<T> extends Serializable {
>   void initialize(Configuration conf);
>   Iterable<T> aggregate(Iterable<T> values);
> }
>
> This would make writing anonymous inner classes for combineValues()
> much easier and shorter. Implementations could still keep a buffer
> between calls to aggregate() if they want to optimize memory
> management.
>
> Is there anything I'm overlooking?
>
> Regards,
>   Matthias
>
> [1] https://issues.apache.org/jira/browse/CRUNCH-106
>
> On Friday, 2012-11-02, Matthias Friedrich wrote:
> > Hi,
> >
> > in our last discussion on refactoring we agreed (I think) to clean
> > up CombineFn and move implementations to .fn.CombineFns. While playing
> > with the API, I noticed that it's harder than necessary to get your
> > generics in line. Example:
> >
> >   PGroupedTable<String, Integer> table = /* ... */;
> >   table.combineValues(CombineFn.<String>SUM_INTS());
> >
> > You need to specify the String type or you get a compilation error.
> > This leaks an implementation detail that isn't obvious to users; you
> > need to know how CombineFn works to know that you have to specify the
> > key type. And it's just plain ugly.
> >
> > How about using Aggregator more prominently by adding a
> > combineValues(Aggregator<V> a) method to PGroupedTable? We could
> > simplify the code and make it easier to understand:
> >
> >   PGroupedTable<String, Integer> table = /* ... */;
> >   table.combineValues(Aggregators.SUM_INTS());
> >
> > Writing custom aggregations would be easier, too, as you don't need to
> > decorate your aggregator with AggregatorCombineFn yourself anymore. We
> > could go even further and remove combineValues(CombineFn), because it
> > doesn't add much value anymore. If you need maximum flexibility (that
> > is, your aggregation depends on the table's key somehow), you can
> > still use parallelDo() with CombineFn directly. I don't think that's
> > a common use case though.
> >
> > Here's what I propose:
> >
> >   1) Leave CombineFn and Aggregator in base
> >   2) Remove all static factories (SUM_INTS etc.) from CombineFn
> >   3) Create an Aggregators class in .fn with static factories
> >      for everything removed from CombineFn
> >   4) Make the implementation classes private, they just take up
> >      lots of space in javaodcs
> >   5) Bonus: Remove combineValues(CombineFn)
> >
> > One thing I don't understand: Is there really a need for
> > AggregatorFactory? Am I overlooking something? It's not like we have
> > to create Aggregator instances for each key -- that's what
> > Aggregator.reset() is for.
> >
> > Regards,
> >   Matthias
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message