[ https://issues.apache.org/jira/browse/SPARK10384?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Xiangrui Meng updated SPARK10384:

Description:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation
and tracks the process of subtasks. Univariate statistics include:
continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
categorical: number of categories, mode
If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,
{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}
Note that some univariate statistics depend on others, e.g., variance might depend on mean
and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.
Univariate statistics for continuous variables:
* min
* max
* range (SPARK10861)
* mean
* sample variance (SPARK9296)
* population variance (SPARK9296)
* sample standard deviation (SPARK6458)
* population standard deviation (SPARK6458)
* skewness (SPARK10641)
* kurtosis (SPARK10641)
* approximate median (SPARK6761)
* approximate quantiles (SPARK6761)
Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories
was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation
and tracks the process of subtasks. Univariate statistics include:
continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
categorical: number of categories, mode
If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,
{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}
Note that some univariate statistics depend on others, e.g., variance might depend on mean
and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.
Univariate statistics for continuous variables:
* min
* max
* range (SPARK10861)
* mean
* sample variance (SPARK9296)
* population variance (SPARK9296)
* sample standard deviation (SPARK6458)
* population standard deviation (SPARK6458)
* skewness (SPARK10641)
* kurtosis (SPARK10641)
* approximate median
* approximate quantiles
Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories
> Univariate statistics as UDAFs
> 
>
> Key: SPARK10384
> URL: https://issues.apache.org/jira/browse/SPARK10384
> Project: Spark
> Issue Type: Umbrella
> Components: ML, SQL
> Reporter: Xiangrui Meng
> Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA discusses general
implementation and tracks the process of subtasks. Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might depend on
mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * min
> * max
> * range (SPARK10861)
> * mean
> * sample variance (SPARK9296)
> * population variance (SPARK9296)
> * sample standard deviation (SPARK6458)
> * population standard deviation (SPARK6458)
> * skewness (SPARK10641)
> * kurtosis (SPARK10641)
> * approximate median (SPARK6761)
> * approximate quantiles (SPARK6761)
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories

This message was sent by Atlassian JIRA
(v6.3.4#6332)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
