flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lisonbee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3664) Create a method to easily Summarize a DataSet
Date Thu, 24 Mar 2016 15:31:25 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210409#comment-15210409

Todd Lisonbee commented on FLINK-3664:

Hi Fabian, thanks for the feedback.

Your first 3 comments all make sense - agreed.

On distinct counts, I thought about it but wasn't sure so I left it out for now.  For an approximate,
the best idea I had was to choose some arbitrary number, maybe 100.  And then just report
the exact number of distinct values if less than 100, or to say 100+ if greater than 100.
 This would be nice for categorical variables that happen to have less than 100 different
values.  But with enough rows and columns it could be expensive (even if Tuple is currently
limited to 22) or at least relatively more expensive than the other calculations.  There isn't
a perfect magic number.  I didn't like this idea all of the way.

Do you know of a nice way to approximate distinct counts?


> Create a method to easily Summarize a DataSet
> ---------------------------------------------
>                 Key: FLINK-3664
>                 URL: https://issues.apache.org/jira/browse/FLINK-3664
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>         Attachments: DataSet-Summary-Design-March2016-v1.txt
> Here is an example:
> {code}
> /**
>  * Summarize a DataSet of Tuples by collecting single pass statistics for all columns
>  */
> public Tuple summarize()
> Dataset<Tuple3<Double, String, Boolean>> input = // [...]
> Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary =
> summary.getField(0).stddev()
> summary.getField(1).maxStringLength()
> {code}

This message was sent by Atlassian JIRA

View raw message