flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hermann Gábor <reckone...@gmail.com>
Subject Re: Aggregations
Date Tue, 09 Sep 2014 09:42:26 GMT
The only advantage of returning a single value instead of the whole tuple
would be having smaller data. I agree, it is not that useful, and the logic
that you proposed earlier could simply provide this with a single

In addition, isn't it possible to provide the mechanism in your proposal,
the user needing to set the return types? Can the types be extracted from
tuple and the aggregation (e.g. average should be a Double)?

A simple example of the custom return type reduce function is a modified

	public class WC {
		public String word;
		public int count;
		// [...]

	public class WordCounter implements ReduceFunction<String, WC> {

		public WC reduce(String word, WC reductionValue) {
			return new WC(word, 1 + reductionValue.count);

	groupedWords.reduce(new WordCounter(), new WC(null, 0));

(Of course this can be easily done with an aggregation, but this was
the simplest use case I could come up with.)

The only advantage here is also the smaller/clearer value and maybe
Functional languages like Haskell support this kind of reduction on
(that is the reason I thought about this). On the other side, there are
many drawbacks
of a reduce function like this (it cannot combine two separately reduced
set of data,
the user must provide an initial value and every reduction like this can be
done with larger
tuples). It is not clear for me whether it would be better or not, but I
thought it's worth consideration.


On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <fhueske@apache.org> wrote:

> Having aggregation functions only returning a single value, is not very
> helpful IMO.
> First, an aggregation function should also work on grouped data sets, i.e.,
> return one aggregate for each group. Hence, the grouping keys must be
> included in the result somehow.
> Second, imaging a use case where the min, max, and avg value of some fields
> of a tuple are needed. If this would be computed with multiple independent
> aggregation functions, the data set would be shuffled and reduced three
> times and possibly joined again.
> I think it should be possible to combine multiple aggregation functions,
> e.g., compute a result with field 2 as grouping key, the minimum and
> maximum of field 3 and the average of field 5.
> Basically, have something like the project operator but with aggregation
> functions and keys. This is also what I sketched in my proposal.
> @Hermann: Regarding the reduce function with custom return type, do you
> have some concrete use case in mind for that?
> Cheers, Fabian
> 2014-09-08 14:20 GMT+02:00 Hermann Gábor <reckoner42@gmail.com>:
> > I also agree on using the minBy as the default mechanism.
> >
> > If both min and minBy are needed, it would seem more natural for min (and
> > also for sum) to return only the given field of the tuple in my opinion.
> >
> > More generally a reduce function with a custom return type would also be
> > useful in my view. In that case the user would also give a value of type
> T
> > to begin the reduction with, and implement a function which reduces a
> value
> > and a value of type T and return a value of type T. Would that make
> sense?
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message