hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Date Tue, 29 Dec 2015 10:53:41 GMT
Hi,

 

I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion
rows as follows:

 

show create table sales;

+---------------------------------------------------------------------------
----+--+

|                                createtab_stmt
|

+---------------------------------------------------------------------------
----+--+

| CREATE TABLE `sales`(
|

|   `prod_id` bigint,
|

|   `cust_id` bigint,
|

|   `time_id` timestamp,
|

|   `channel_id` bigint,
|

|   `promo_id` bigint,
|

|   `quantity_sold` decimal(10,0),
|

|   `amount_sold` decimal(10,0))
|

| CLUSTERED BY (
|

|   prod_id,
|

|   cust_id,
|

|   time_id,
|

|   channel_id,
|

|   promo_id)
|

| INTO 256 BUCKETS
|

| ROW FORMAT SERDE
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|

| STORED AS INPUTFORMAT
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|

| OUTPUTFORMAT
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|

| LOCATION
|

|   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
|

| TBLPROPERTIES (
|

|   'COLUMN_STATS_ACCURATE'='true',
|

|   'last_modified_by'='hduser',
|

|   'last_modified_time'='1451305626',
|

|   'numFiles'='11',
|

|   'numRows'='1000000000',
|

|
'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
|

|   'orc.bloom.filter.fpp'='0.05',
|

|   'orc.compress'='SNAPPY',
|

|   'orc.create.index'='true',
|

|   'orc.row.index.stride'='10000',
|

|   'orc.stripe.size'='268435456',
|

|   'rawDataSize'='296000000000',
|

|   'totalSize'='2678882153',
|

|   'transient_lastDdlTime'='1451305626')
|

+---------------------------------------------------------------------------
----+--+

 

I use the following query to run against sales table only against Hive

 

SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

FROM

(

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        stddev_samp(amount_sold) AS Standard_deviation

        FROM sales

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < stddev_samp(amount_sold)

) rs

ORDER BY

          -- Total_customer_amount DESC

          3 DESC

 

Hive comes back in 17 minutes with 5,948 rows

 

bl -f sales.hql > sales.log

Connecting to jdbc:hive2://rhes564:10010/default

Connected to: Apache Hive (version 1.2.1)

Driver: Hive JDBC (version 1.2.1)

Transaction isolation: TRANSACTION_REPEATABLE_READ

Running init script /home/hduser/dba/bin/hive_on_spark_init.hql

No rows affected (0.097 seconds)

No rows affected (0.001 seconds)

No rows affected (0.001 seconds)

No rows affected (0.038 seconds)

INFO  : Warning: Using constant number 3 in order by. If you try to use
position alias when hive.groupby.orderby.position.alias is false, the
position alias will be ignored.

INFO  :

Query Hive on Spark job[0] stages:

INFO  : 0

INFO  : 1

INFO  : 2

INFO  :

Status: Running (Hive on Spark job[0])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

INFO  : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009
Stage-2_0: 0/1

 

INFO  : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009
Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished       Stage-1_0:
0(+2)/1009   Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished       Stage-1_0:
6(+2)/1009   Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished       Stage-1_0:
30(+2)/1009  Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished       Stage-1_0:
59(+2)/1009  Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished       Stage-1_0:
90(+2)/1009  Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished       Stage-1_0:
124(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished       Stage-1_0:
160(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished       Stage-1_0:
193(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished       Stage-1_0:
228(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished       Stage-1_0:
262(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished       Stage-1_0:
298(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished       Stage-1_0:
338(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished       Stage-1_0:
376(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished       Stage-1_0:
417(+3)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished       Stage-1_0:
460(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished       Stage-1_0:
502(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished       Stage-1_0:
542(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished       Stage-1_0:
584(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished       Stage-1_0:
624(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished       Stage-1_0:
667(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished       Stage-1_0:
709(+3)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished       Stage-1_0:
754(+3)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished       Stage-1_0:
797(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished       Stage-1_0:
844(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished       Stage-1_0:
888(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished       Stage-1_0:
934(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished       Stage-1_0:
981(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished       Stage-1_0:
1009/1009 Finished   Stage-2_0: 0(+1)/1

INFO  : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished       Stage-1_0:
1009/1009 Finished   Stage-2_0: 1/1 Finished

INFO  : Status: Finished successfully in 1036.00 seconds

5,948 rows selected (1074.817 seconds)

 

So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in
23 seconds

 

Sybase IQ is a columnar database so each column is created as a fast
projection index by default. In addition I have created LF (bitmap) indexes
on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now
the query only touches CUST_ID.

 

My suspicion is that it is the Standard Deviation function stddev_samp()
that could be the bottleneck?

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 


Mime
View raw message