flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Aggregations
Date Tue, 09 Sep 2014 09:05:13 GMT
Let's come up with a comprehensive design that works for both Batch and
Streaming API.

It would be good to include aggregation functions that internally break
down into multiple aggregations (like AVG breaking down to a count and
sum), respecting that no aggregate is computed twice unnecessarily.

On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <fhueske@apache.org> wrote:

> Having aggregation functions only returning a single value, is not very
> helpful IMO.
> First, an aggregation function should also work on grouped data sets, i.e.,
> return one aggregate for each group. Hence, the grouping keys must be
> included in the result somehow.
> Second, imaging a use case where the min, max, and avg value of some fields
> of a tuple are needed. If this would be computed with multiple independent
> aggregation functions, the data set would be shuffled and reduced three
> times and possibly joined again.
> I think it should be possible to combine multiple aggregation functions,
> e.g., compute a result with field 2 as grouping key, the minimum and
> maximum of field 3 and the average of field 5.
> Basically, have something like the project operator but with aggregation
> functions and keys. This is also what I sketched in my proposal.
> @Hermann: Regarding the reduce function with custom return type, do you
> have some concrete use case in mind for that?
> Cheers, Fabian
> 2014-09-08 14:20 GMT+02:00 Hermann Gábor <reckoner42@gmail.com>:
> > I also agree on using the minBy as the default mechanism.
> >
> > If both min and minBy are needed, it would seem more natural for min (and
> > also for sum) to return only the given field of the tuple in my opinion.
> >
> > More generally a reduce function with a custom return type would also be
> > useful in my view. In that case the user would also give a value of type
> T
> > to begin the reduction with, and implement a function which reduces a
> value
> > and a value of type T and return a value of type T. Would that make
> sense?
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message