hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Standard deviation (STDDEV) function calculation in Hive
Date Wed, 01 Apr 2015 07:23:54 GMT
Hi Lefty,

ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev", new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev_pop", new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev_samp", new GenericUDAFStdSample());

Looks like stddev() in hive is stddev_pop(), you can update the UDF docs to
match the FunctionRegistry aliases.

Cheers,
Gopal

From:  Lefty Leverenz <leftyleverenz@gmail.com>
Reply-To:  "user@hive.apache.org" <user@hive.apache.org>
Date:  Tuesday, March 31, 2015 at 11:13 PM
To:  "user@hive.apache.org" <user@hive.apache.org>
Subject:  Re: Standard deviation (STDDEV) function calculation in Hive

Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is
the same, I don't know.)  But the Hive documentation doesn't explain much:
Built-in Aggregate Functions (UDAF)
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Languag
eManualUDF-Built-inAggregateFunctions(UDAF)> .

-- Lefty

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mich@peridale.co.uk>
wrote:
> Hi,
>  
> Basically, the standard deviation (STDDEV) is a measure that is used to
> quantify the amount of variation or dispersion of a set of data values. It is
> widely used in trading systems such as FX.
>  
> STDEDV in Hive is explained here
> <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-
> 0-x/topics/impala_stddev.html>  as below and I quote:
>  
> The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard
> deviation and sample standard deviation, respectively, of the input values.
> (STDDEV() is an alias for STDDEV_SAMP().) Both functions evaluate all input
> rows matched by the query. The difference is that STDDEV_SAMP() is scaled by
> 1/(N-1) while STDDEV_POP() is scaled by 1/N.
>  
> Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and
> according to above it ought to be STDDEV_SAMP.
>  
> However, when I work these out (and also use straight forward calculation
> myself), it turns out that the alias seems to be to STDDVE_POP as opposed to
> STDDEV_SAMP!.
>  
> The following calculation shows this
>  
> SELECT
>           rs.Customer_ID
>         , rs.Total_customer_amount
>        , rs.stddev
>         , rs.sdddev_samp
>         , rs.mystddev
> FROM
> (
>         SELECT cust_id AS Customer_ID,
>         COUNT(amount_sold) AS Number_of_orders,
>         SUM(amount_sold) AS Total_customer_amount,
>         AVG(amount_sold) AS Average_order,
>         stddev(amount_sold) AS stddev,
>         stddev_samp(amount_sold) AS sdddev_samp,
>         CASE
>         WHEN  COUNT (amount_sold) <= 1
>            THEN  0
>          ELSE  
> SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1
> )-1))
>          END AS mystddev
>         FROM sales
>         GROUP BY cust_id
>         HAVING SUM(amount_sold) > 94000
>         AND AVG(amount_sold) < stddev(amount_sold)
> ) rs
> ORDER BY
>           rs.Total_customer_amount DESC
> ;
> +-----------------+---------------------------+--------------------+----------
> ----------+--------------------+--+
> | rs.customer_id  | rs.total_customer_amount  |     rs.stddev      |
> rs.sdddev_samp   |   rs.mystddev     |
> +-----------------+---------------------------+--------------------+----------
> ----------+--------------------+--+
> | 11407.0         | 103412.65999999995        | 622.221465710723   |
> 623.4797510518939  | 623.4797510518938  |
> | 10747.0         | 99578.08999999997         | 600.7615005975689  |
> 601.9383117167412  | 601.9383117167412  |
>  
> OK so looking above, we notice that rs.sdddev_sampandrs.mystddevare
> practically identical, whereas what is referred to as rs.stddev in Hive is not
> the one used in industry?
>  
> To show I ran the same in Oracle and the below is the result.
>  
> Connected to:
> Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
> With the Partitioning, OLAP, Data Mining and Real Application Testing options
>  
> CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT                STDDEV              MYSTDDEV
> ----------- --------------------- --------------------- ---------------------
>       11407             103412.66     623.4797510518940     623.4797510518940
>       10747              99578.09     601.9383117167410     601.9383117167410
>  
> So sounds like for one reason or other what is called STDDEV in Hive and
> aliased to STDDEV_SAMP is incorrect?
>  
> Thanks,
>  
> Mich Talebzadeh
>  
> http://talebzadehmich.wordpress.com
>  
> Publications due shortly:
> Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
> Coherence Cache
>  
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries
> or their employees, unless expressly so stated. It is the responsibility of
> the recipient to ensure that this email is virus free, therefore neither
> Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>  




Mime
View raw message