incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Friedrich <m...@mafr.de>
Subject Re: CombineFn vs. Aggregator
Date Sun, 04 Nov 2012 12:46:40 GMT
Hi,

having looked at every single CombineFn and Aggregator in Crunch's
code base and in light of CRUNCH-106 [1], I wonder if we could make
things even easier to use. How about changing the Aggregator interface
to this:

public interface Aggregator<T> extends Serializable {
  void initialize(Configuration conf);
  Iterable<T> aggregate(Iterable<T> values);
}

This would make writing anonymous inner classes for combineValues()
much easier and shorter. Implementations could still keep a buffer
between calls to aggregate() if they want to optimize memory
management.

Is there anything I'm overlooking?

Regards,
  Matthias

[1] https://issues.apache.org/jira/browse/CRUNCH-106

On Friday, 2012-11-02, Matthias Friedrich wrote:
> Hi,
> 
> in our last discussion on refactoring we agreed (I think) to clean
> up CombineFn and move implementations to .fn.CombineFns. While playing
> with the API, I noticed that it's harder than necessary to get your
> generics in line. Example:
> 
>   PGroupedTable<String, Integer> table = /* ... */;
>   table.combineValues(CombineFn.<String>SUM_INTS());
> 
> You need to specify the String type or you get a compilation error.
> This leaks an implementation detail that isn't obvious to users; you
> need to know how CombineFn works to know that you have to specify the
> key type. And it's just plain ugly.
> 
> How about using Aggregator more prominently by adding a
> combineValues(Aggregator<V> a) method to PGroupedTable? We could
> simplify the code and make it easier to understand:
> 
>   PGroupedTable<String, Integer> table = /* ... */;
>   table.combineValues(Aggregators.SUM_INTS());
> 
> Writing custom aggregations would be easier, too, as you don't need to
> decorate your aggregator with AggregatorCombineFn yourself anymore. We
> could go even further and remove combineValues(CombineFn), because it
> doesn't add much value anymore. If you need maximum flexibility (that
> is, your aggregation depends on the table's key somehow), you can
> still use parallelDo() with CombineFn directly. I don't think that's
> a common use case though.
> 
> Here's what I propose:
> 
>   1) Leave CombineFn and Aggregator in base
>   2) Remove all static factories (SUM_INTS etc.) from CombineFn
>   3) Create an Aggregators class in .fn with static factories
>      for everything removed from CombineFn
>   4) Make the implementation classes private, they just take up
>      lots of space in javaodcs
>   5) Bonus: Remove combineValues(CombineFn)
> 
> One thing I don't understand: Is there really a need for
> AggregatorFactory? Am I overlooking something? It's not like we have
> to create Aggregator instances for each key -- that's what
> Aggregator.reset() is for.
> 
> Regards,
>   Matthias

Mime
View raw message