hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Phillips (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-165) var(col) built-in to go with avg(col) and count(col)
Date Sat, 13 Dec 2008 01:02:44 GMT

    [ https://issues.apache.org/jira/browse/HIVE-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656219#action_12656219

David Phillips commented on HIVE-165:

Rather than add a non-standard VAR function, we should add these ANSI SQL 2003 functions:


I recommend against adding the non-standard STDDEV and VARIANCE functions as their behavior
differs between MySQL (population) and Oracle/PostgreSQL (sample).

For the time being, you can use this workaround:

VAR_POP = (SUM(x * x) - (SUM(x) * AVG(x))) / COUNT(x)
VAR_SAMP = (SUM(x * x) - (SUM(x) * AVG(x))) / (COUNT(x) - 1)

> var(col) built-in to go with avg(col) and count(col)
> ----------------------------------------------------
>                 Key: HIVE-165
>                 URL: https://issues.apache.org/jira/browse/HIVE-165
>             Project: Hadoop Hive
>          Issue Type: Wish
>            Reporter: Adam Kramer
>            Priority: Minor
> The last step in the unholy triumvirate of statistical built-ins is the variance. We
already have the n (count) and the mean (avg). I currently have a job or two that filters
all of the data into a single reducer which just computes mean/n/variance and writes it to
a table...so my guess is that this would be a pretty big speed increase. Not a huge deal though,
as computing the variance myself is trivial.
> (Average, variance, and n can be co-computed in one pass, so if you're doing var() you
can basically have avg() and count() for free.)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message