hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject RE: Standard deviation (STDDEV) function calculation in Hive
Date Wed, 01 Apr 2015 08:20:22 GMT
Hi Lefty,

 

The Hive aggregate functions as you provide just states:

 


DOUBLE

stddev_pop(col)

Returns the standard deviation of a numeric column in the group.


DOUBLE

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group.

 

There is no mention of STDDEV here. So this is what I did with the previous query runni9ng
and displaying STDDEV, STDDEV_SAMP and STDDDEV_POP

 

0: jdbc:hive2://rhes564:10010/default> SELECT cust_id AS Customer_ID,

. . . . . . . . . . . . . . . . . . .> SUM(amount_sold) AS Total_customer_amount,

. . . . . . . . . . . . . . . . . . .> AVG(amount_sold) AS Average_order,

. . . . . . . . . . . . . . . . . . .> STDDEV(amount_sold) AS STDDEV,

. . . . . . . . . . . . . . . . . . .> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

. . . . . . . . . . . . . . . . . . .> STDDEV_POP(amount_sold) AS STDDEV_POP

. . . . . . . . . . . . . . . . . . .> FROM sales

. . . . . . . . . . . . . . . . . . .> GROUP BY cust_id

. . . . . . . . . . . . . . . . . . .> HAVING SUM(amount_sold) > 94000

. . . . . . . . . . . . . . . . . . .> AND AVG(amount_sold) < STDDEV_SAMP(amount_sold)

. . . . . . . . . . . . . . . . . . .> limit 3

. . . . . . . . . . . . . . . . . . .> ;

 

 

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

| customer_id  | total_customer_amount  |    average_order    |       stddev       |    stddev_samp
    |     stddev_pop     |

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

| 429.0        | 94819.41000000006      | 410.4736363636366   | 613.7057080691426  | 615.0384039014772
 | 613.7057080691426  |

| 1743.0       | 94786.12999999993      | 398.2610504201678   | 581.0439095219863  | 582.2684502048478
 | 581.0439095219863  |

| 2994.0       | 94862.61000000006      | 417.89696035242315  | 623.1607772763742  | 624.5379298449825
 | 623.1607772763742  |

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

3 rows selected (205.266 seconds)

 

So according to above what Hive calls STDDEV is a pointer to STDDEV_POP.

 

But Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP as shown in the results below
from Oracle and Sybase respectively

 

CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT AVERAGE_ORDER     STDDEV STDDEV_SAMP STDDEV_POP

----------- --------------------- ------------- ---------- ----------- ----------

          429              94819.41    410.473636 615.038404  615.038404 613.705708

         1743              94786.13     398.26105  582.26845   582.26845  581.04391

         2994              94862.61     417.89696  624.53793   624.53793 623.160777

 

 

 

Customer_ID                               Total_customer_amount                          
        Average_order                       STDDEV                      STDDEV_SAMP      
          STDDEV_POP

-------------------- ----------------------------------------- ----------------------------------------------------
--------------------------- --------------------------- ---------------------------

                  429                                  94819.41                          
         410.4736363636363                  615.038404                  615.038404       
          613.705708

                 1743                                  94786.13                          
         398.2610504201680                  582.268450                  582.268450       
          581.043910

                 2994                                  94862.61                          
         417.8969603524229                  624.537930                  624.537930       
          623.160777

 

So may be the point goes beyond Hive documentation. The value provided by STDDEV in Hive does
not appear to be industry standard

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Ltd,
its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 

From: Lefty Leverenz [mailto:leftyleverenz@gmail.com] 
Sent: 01 April 2015 07:14
To: user@hive.apache.org
Subject: Re: Standard deviation (STDDEV) function calculation in Hive

 

Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is the same, I don't
know.)  But the Hive documentation doesn't explain much:  Built-in Aggregate Functions (UDAF)
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)>
.




-- Lefty

 

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mich@peridale.co.uk> wrote:

Hi,

 

Basically, the standard deviation (STDDEV) is a measure that is used to quantify the amount
of variation or dispersion of a set of data values. It is widely used in trading systems such
as FX.

 

STDEDV in Hive is explained here <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_stddev.html>
 as below and I quote:

 

The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard deviation and
sample standard deviation, respectively, of the input values. (STDDEV() is an alias for STDDEV_SAMP().)
Both functions evaluate all input rows matched by the query. The difference is that STDDEV_SAMP()
is scaled by 1/(N-1) while STDDEV_POP() is scaled by 1/N.

 

Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and according to above
it ought to be STDDEV_SAMP.

 

However, when I work these out (and also use straight forward calculation myself), it turns
out that the alias seems to be to STDDVE_POP as opposed to STDDEV_SAMP!.

 

The following calculation shows this

 

SELECT

          rs.Customer_ID

        , rs.Total_customer_amount

       , rs.stddev

        , rs.sdddev_samp

        , rs.mystddev

FROM

(

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        stddev(amount_sold) AS stddev,

        stddev_samp(amount_sold) AS sdddev_samp,

        CASE

        WHEN  COUNT (amount_sold) <= 1

           THEN  0

         ELSE  SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1)-1))

         END AS mystddev

        FROM sales

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < stddev(amount_sold)

) rs

ORDER BY

          rs.Total_customer_amount DESC

;

+-----------------+---------------------------+--------------------+--------------------+--------------------+--+

| rs.customer_id  | rs.total_customer_amount  |     rs.stddev      |   rs.sdddev_samp   |
   rs.mystddev     |

+-----------------+---------------------------+--------------------+--------------------+--------------------+--+

| 11407.0         | 103412.65999999995        | 622.221465710723   | 623.4797510518939  |
623.4797510518938  |

| 10747.0         | 99578.08999999997         | 600.7615005975689  | 601.9383117167412  |
601.9383117167412  |

 

OK so looking above, we notice that rs.sdddev_samp and rs.mystddev are practically identical,
whereas what is referred to as rs.stddev in Hive is not the one used in industry?  

 

To show I ran the same in Oracle and the below is the result.

 

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

 

CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT                STDDEV              MYSTDDEV

----------- --------------------- --------------------- ---------------------

      11407             103412.66     623.4797510518940     623.4797510518940

      10747              99578.09     601.9383117167410     601.9383117167410

 

So sounds like for one reason or other what is called STDDEV in Hive and aliased to STDDEV_SAMP
is incorrect?

 

Thanks,

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Ltd,
its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 

 


Mime
View raw message