hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lefty Leverenz <leftylever...@gmail.com>
Subject Re: Standard deviation (STDDEV) function calculation in Hive
Date Wed, 01 Apr 2015 06:13:44 GMT
Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is
the same, I don't know.)  But the Hive documentation doesn't explain
much:  Built-in
Aggregate Functions (UDAF)
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)>
.

-- Lefty

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mich@peridale.co.uk>
wrote:

> Hi,
>
>
>
> Basically, the standard deviation (STDDEV) is a measure that is used to
> quantify the amount of variation or dispersion of a set of data values. It
> is widely used in trading systems such as FX.
>
>
>
> STDEDV in Hive is explained here
> <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_stddev.html>
> as below and I quote:
>
>
>
> The STDDEV_POP() and STDDEV_SAMP() functions compute the population
> standard deviation and sample standard deviation, respectively, of the
> input values. (*STDDEV()* *is an alias for STDDEV_SAMP().)* Both
> functions evaluate all input rows matched by the query. The difference is
> that STDDEV_SAMP() is scaled by 1/(N-1) while STDDEV_POP() is scaled by
> 1/N.
>
>
>
> Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and
> according to above it ought to be STDDEV_SAMP.
>
>
>
> However, when I work these out (and also use straight forward calculation
> myself), it turns out that the alias seems to be to STDDVE_POP as opposed
> to STDDEV_SAMP!.
>
>
>
> The following calculation shows this
>
>
>
> SELECT
>
>           rs.Customer_ID
>
>         , rs.Total_customer_amount
>
>        , rs.stddev
>
>         , rs.sdddev_samp
>
>         , rs.mystddev
>
> FROM
>
> (
>
>         SELECT cust_id AS Customer_ID,
>
>         COUNT(amount_sold) AS Number_of_orders,
>
>         SUM(amount_sold) AS Total_customer_amount,
>
>         AVG(amount_sold) AS Average_order,
>
>         stddev(amount_sold) AS stddev,
>
>         stddev_samp(amount_sold) AS sdddev_samp,
>
>         CASE
>
>         WHEN  COUNT (amount_sold) <= 1
>
>            THEN  0
>
>          ELSE
> SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1)-1))
>
>          END AS mystddev
>
>         FROM sales
>
>         GROUP BY cust_id
>
>         HAVING SUM(amount_sold) > 94000
>
>         AND AVG(amount_sold) < stddev(amount_sold)
>
> ) rs
>
> ORDER BY
>
>           rs.Total_customer_amount DESC
>
> ;
>
>
> +-----------------+---------------------------+--------------------+--------------------+--------------------+--+
>
> | rs.customer_id  | rs.total_customer_amount  |     rs.stddev      |
> rs.sdddev_samp   |    rs.mystddev     |
>
>
> +-----------------+---------------------------+--------------------+--------------------+--------------------+--+
>
> | 11407.0         | 103412.65999999995        | 622.221465710723   |
> 623.4797510518939  | 623.4797510518938  |
>
> | 10747.0         | 99578.08999999997         | 600.7615005975689  |
> 601.9383117167412  | 601.9383117167412  |
>
>
>
> OK so looking above, we notice that rs.sdddev_samp and rs.mystddev are
> practically identical, whereas what is referred to as rs.stddev in Hive
> is not the one used in industry?
>
>
>
> To show I ran the same in Oracle and the below is the result.
>
>
>
> Connected to:
>
> Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit
> Production
>
> With the Partitioning, OLAP, Data Mining and Real Application Testing
> options
>
>
>
> CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT                STDDEV
> MYSTDDEV
>
> ----------- --------------------- ---------------------
> ---------------------
>
>       11407             103412.66     623.4797510518940
> 623.4797510518940
>
>       10747              99578.09     601.9383117167410
> 601.9383117167410
>
>
>
> So sounds like for one reason or other what is called STDDEV in Hive and
> aliased to STDDEV_SAMP is incorrect?
>
>
>
> Thanks,
>
>
>
> Mich Talebzadeh
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Publications due shortly:*
>
> *Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
> Coherence Cache*
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
>

Mime
View raw message