flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lisonbee (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations
Date Tue, 22 Mar 2016 19:03:25 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207054#comment-15207054
] 

Todd Lisonbee edited comment on FLINK-3613 at 3/22/16 7:02 PM:
---------------------------------------------------------------

Attached is a design for improvements to DataSet.aggregate() needed to implement additional
aggregations like Standard Deviation.

To maintain public API's it seems like the best path would be to have AggregateOperator implement
CustomUnaryOperation but that seems weird because no other Operator is done that way.  But
other options I see don't seem consistent with other Operators either.

I really could use some feedback on this.  Thanks.

Also, should I be posting this to the Dev mailing list?


was (Author: tlisonbee):
Attached is a design for improvements to DataSet.aggregate() needed to implement additional
aggregations like Standard Deviation.

To maintain public API's it seems like the best path would be to have AggregateOperator implement
CustomUnaryOperation but that seems weird because no other Operator is done that way.  But
other options I see don't seem consistent with other Operators either.

I really could use some feedback on this.  Thanks.

> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>         Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et al, International
Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces the numerical
errors that occur when adding a sequence of finite precision floating point numbers. Numerical
errors arise due to truncation and rounding. These errors can lead to numerical instability
when calculating variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message