spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <>
Subject [jira] [Created] (SPARK-10384) Univariate statistics as UDAFs
Date Tue, 01 Sep 2015 06:45:46 GMT
Xiangrui Meng created SPARK-10384:

             Summary: Univariate statistics as UDAFs
                 Key: SPARK-10384
             Project: Spark
          Issue Type: Umbrella
          Components: ML, SQL
            Reporter: Xiangrui Meng
            Assignee: Burak Yavuz

It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation
and tracks the process of subtasks. Univariate statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,

df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))

Note that some univariate statistics depend on others, e.g., variance might depend on mean
and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message