flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations
Date Fri, 18 Mar 2016 09:29:33 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201239#comment-15201239
] 

Fabian Hueske commented on FLINK-3613:
--------------------------------------

Hi [~tlisonbee], welcome to the Flink community!

Adding more aggregation functions is a very good place to start, IMO.
As you observed, the current [[AggregationFunction}} interface is very basic and quite limited.
Since this is marked as an {{@Internal}} API it can be changed without worrying about backwards
compatibility. 

How about you design a new interface and sketch a brief design doc?

Btw. FLINK-2144 is about the streaming API whereas this issue addresses the DataSet batch
API. I'm not aware of another JIRA that proposes to add more aggregation functions to the
DataSet API or anybody working this. 

Thanks, Fabian

> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>
> Implement standard deviation, mean, variance for org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et al, International
Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces the numerical
errors that occur when adding a sequence of finite precision floating point numbers. Numerical
errors arise due to truncation and rounding. These errors can lead to numerical instability
when calculating variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message