hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject My query on Standard deviation function STDDEV() in Hive
Date Fri, 27 Mar 2015 21:49:01 GMT
Hi gurus,

 

STDDEV function can be used for both aggregates and analytics. Fortunately
in Hive it has been implemented which is great. I have a simple question on
this if I may

 

I would expect the in-built function STDDEV to run pretty efficiently in
most databases as they are system defined functions. At least that is my
understanding. I did a test on Hive with both STDDEV in-built function and
also the equivalent of it that I wrote myself

 

In this example the base table sales 918,843 rows imported from Oracle
sh.sales table. and I am trying to work STDDEV for as column called
amount_sold

 

SELECT CASE

        WHEN  COUNT (amount_sold) <= 1

           THEN  0

         ELSE
SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT
(1)-1))

         END

FROM sales;

 

I ran this first with straight forward calculation for standard deviation

 

1)     It returned 259.7804899502628               in 19.656 seconds

 

Then I ran it immediately after using STDDV function itself in hive

 

hive> SELECT STDDEV(amount_sold) from sales;

 

2)     It retuned 259.7803485874695                in 20.346 seconds

 

Now the value itself is pretty close but seems to be taking slightly more
time in 2)

 

When I did the same tests on base table in Oracle I got first for my
calculation

 

1  SELECT CASE

  2          WHEN  COUNT (amount_sold) <= 1

  3             THEN  0

  4           ELSE
SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT
(1)-1))

  5           END

  6* FROM sh.sales

 

hddtester@MYDB.MICH.LOCAL> /

 

CASEWHENCOUNT(AMOUNT_SOLD)<=1THEN0ELSESQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT
(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))END

----------------------------------------------------------------------------
------------------------------------------------

259.78049

 

Elapsed: 00:00:00.23

 

It took 230 ms. Then I ran again using in built function in Oracle

 

 

hddtester@MYDB.MICH.LOCAL> SELECT STDDEV(amount_sold) FROM sh.sales;

 

STDDEV(AMOUNT_SOLD)

-------------------

          259.78049

 

Elapsed: 00:00:00.16

 

Right I would expect the in-built function to be more efficient and quicker.
In Oracle it took 160 ms compared to 230 ms from my own calculation.

 

My query is why in Hive it takes longer to do the same calculation with
in-built STDDEV function? Is this behaviour expected.

 

 

Thanks

 

Mich

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 


Mime
View raw message