spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
Date Mon, 05 Oct 2015 22:22:26 GMT

     [ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng updated SPARK-10384:
----------------------------------
    Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation
and tracks the process of subtasks. Univariate statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might depend on mean
and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median (SPARK-6761)
* approximate quantiles (SPARK-6761)

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation
and tracks the process of subtasks. Univariate statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might depend on mean
and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> ------------------------------
>
>                 Key: SPARK-10384
>                 URL: https://issues.apache.org/jira/browse/SPARK-10384
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, SQL
>            Reporter: Xiangrui Meng
>            Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA discusses general
implementation and tracks the process of subtasks. Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might depend on
mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761)
> * approximate quantiles (SPARK-6761)
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message