hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <>
Subject [jira] Commented: (HIVE-165) var(col) built-in to go with avg(col) and count(col)
Date Sat, 13 Dec 2008 01:16:46 GMT


Adam Kramer commented on HIVE-165:

I agree, and have been annoyed by the inconsistency between POP and SAMP versions of var().

I have used a similar workaround, but since it (currently) takes two mapreduce steps, it's
faster to write my own single-reducer script. If I was using huge data sets, the above would
be faster, though...but I do worry a bit that for really huge data sets, SUM(x*x) might overflow.

> var(col) built-in to go with avg(col) and count(col)
> ----------------------------------------------------
>                 Key: HIVE-165
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Wish
>            Reporter: Adam Kramer
>            Assignee: David Phillips
>            Priority: Minor
> The last step in the unholy triumvirate of statistical built-ins is the variance. We
already have the n (count) and the mean (avg). I currently have a job or two that filters
all of the data into a single reducer which just computes mean/n/variance and writes it to
a my guess is that this would be a pretty big speed increase. Not a huge deal though,
as computing the variance myself is trivial.
> (Average, variance, and n can be co-computed in one pass, so if you're doing var() you
can basically have avg() and count() for free.)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message