hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuefu Zhang <xzh...@cloudera.com>
Subject Re: Running Spark-sql on Hive metastore
Date Mon, 01 Feb 2016 03:05:09 GMT
For Hive on Spark, there is a startup cost. The second run should be
faster. More importantly, it looks like you have 18 map tasks but only your
cluster only runs two of them at a time. Thus, you cluster is basically
having only two way parallelism. If you configure your cluster to give more
capacity to Hive, the speed should improve as well. Note that each your map
task takes only seconds to complete.

On Sun, Jan 31, 2016 at 3:07 PM, Mich Talebzadeh <mich@peridale.co.uk>
wrote:

> Hi,
>
>
>
> ·         Spark 1.5.2 on Hive 1.2.1
>
> ·         Hive 1.2.1 on Spark 1.3.1
>
> ·         Oracle Release 11.2.0.1.0
>
> ·         Hadoop 2.6
>
>
>
> I am running spark-sql using Hive metastore and I am pleasantly surprised
> by the speed by which Spark performs certain queries on Hive tables.
>
>
>
> I imported a 100 Million rows table from Oracle into a Hive staging table
> via Sqoop and then did an insert/select into an ORC table in Hive as
> defined below.
>
>
>
> +------------------------------------------------------------+--+
>
> |                       createtab_stmt                       |
>
> +------------------------------------------------------------+--+
>
> | CREATE TABLE `dummy`(                                      |
>
> |   `id` int,                                                |
>
> |   `clustered` int,                                         |
>
> |   `scattered` int,                                         |
>
> |   `randomised` int,                                        |
>
> |   `random_string` varchar(50),                             |
>
> |   `small_vc` varchar(10),                                  |
>
> |   `padding` varchar(10))                                   |
>
> | CLUSTERED BY (                                             |
>
> |   id)                                                      |
>
> | INTO 256 BUCKETS                                           |
>
> | ROW FORMAT SERDE                                           |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'              |
>
> | STORED AS INPUTFORMAT                                      |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'        |
>
> | OUTPUTFORMAT                                               |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'       |
>
> | LOCATION                                                   |
>
> |   'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy'  |
>
> | TBLPROPERTIES (                                            |
>
> |   'COLUMN_STATS_ACCURATE'='true',                          |
>
> |   'numFiles'='35',                                         |
>
> |   'numRows'='100000000',                                   |
>
> |   'orc.bloom.filter.columns'='ID',                         |
>
> |   'orc.bloom.filter.fpp'='0.05',                           |
>
> |   'orc.compress'='SNAPPY',                                 |
>
> |   'orc.create.index'='true',                               |
>
> |   'orc.row.index.stride'='10000',                          |
>
> |   'orc.stripe.size'='16777216',                            |
>
> |   'rawDataSize'='33800000000',                             |
>
> |   'totalSize'='5660813776',                                |
>
> |   'transient_lastDdlTime'='1454234981')                    |
>
> +------------------------------------------------------------+--+
>
>
>
> I am doing simple min,max functions on columns scattered and randomised
> from the above table that are not part of cluster etc in Hive. In addition,
> in Oracle there is no index on these columns as well.
>
>
>
> *If I use Hive 1.2.1 on Spark 1.3.1 it comes back in 50.751 seconds*
>
>
>
> *select min(scattered), max(randomised) from dummy;*
>
> INFO  :
>
> Query Hive on Spark job[0] stages:
>
> INFO  : 0
>
> INFO  : 1
>
> INFO  :
>
> Status: Running (Hive on Spark job[0])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO  : 2016-01-31 22:55:05,114 Stage-0_0: 0/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:06,122 Stage-0_0: 0(+2)/18     Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:09,165 Stage-0_0: 0(+2)/18     Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:12,190 Stage-0_0: 2(+2)/18     Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:14,201 Stage-0_0: 3(+2)/18     Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:15,209 Stage-0_0: 4(+2)/18     Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:17,218 Stage-0_0: 6(+2)/18     Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:20,234 Stage-0_0: 8(+2)/18     Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:22,245 Stage-0_0: 10(+2)/18    Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:25,257 Stage-0_0: 12(+2)/18    Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:27,270 Stage-0_0: 14(+2)/18    Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:30,289 Stage-0_0: 16(+2)/18    Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:31,294 Stage-0_0: 17(+1)/18    Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:32,302 Stage-0_0: 18/18 Finished       Stage-1_0:
> 0(+1)/1
>
> INFO  : 2016-01-31 22:55:33,309 Stage-0_0: 18/18 Finished       Stage-1_0:
> 1/1 Finished
>
> INFO  : Status: Finished successfully in 46.37 seconds
>
> +------+------+--+
>
> | _c0  | _c1  |
>
> +------+------+--+
>
> | 0    | 999  |
>
> +------+------+--+
>
> *1 row selected (50.751 seconds)*
>
>
>
> *If I use Spark 1.5.2 on Hive 1.2.1 it comes back in 7.37 seconds (three
> runs)*
>
>
>
> *select min(scattered), max(randomised) from dummy; *
>
> 16/01/31 22:59:30 INFO parse.ParseDriver: Parsing command: select
> min(scattered), max(randomised) from dummy
>
> 16/01/31 22:59:30 INFO parse.ParseDriver: Parse Completed
>
> 16/01/31 22:59:30 INFO Configuration.deprecation: mapred.map.tasks is
> deprecated. Instead, use mapreduce.job.maps
>
> 16/01/31 22:59:30 INFO storage.MemoryStore: ensureFreeSpace(480952) called
> with curMem=4732, maxMem=555684986
>
> 16/01/31 22:59:30 INFO storage.MemoryStore: Block broadcast_1 stored as
> values in memory (estimated size 469.7 KB, free 529.5 MB)
>
> 16/01/31 22:59:31 INFO storage.MemoryStore: ensureFreeSpace(41724) called
> with curMem=485684, maxMem=555684986
>
> 16/01/31 22:59:31 INFO storage.MemoryStore: Block broadcast_1_piece0
> stored as bytes in memory (estimated size 40.7 KB, free 529.4 MB)
>
> 16/01/31 22:59:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
> in memory on 50.140.197.217:50516 (size: 40.7 KB, free: 529.9 MB)
>
> 16/01/31 22:59:31 INFO spark.SparkContext: Created broadcast 1 from
> processCmd at CliDriver.java:376
>
> 16/01/31 22:59:31 INFO spark.SparkContext: Starting job: processCmd at
> CliDriver.java:376
>
> 16/01/31 22:59:31 INFO log.PerfLogger: <PERFLOG method=OrcGetSplits
> from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
>
> 16/01/31 22:59:31 INFO Configuration.deprecation: mapred.input.dir is
> deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
>
> 16/01/31 22:59:31 INFO orc.OrcInputFormat: FooterCacheHitRatio: 0/0
>
> 16/01/31 22:59:31 INFO log.PerfLogger: </PERFLOG method=OrcGetSplits
> start=1454281171262 end=1454281171330 duration=68
> from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
>
> 16/01/31 22:59:31 INFO scheduler.DAGScheduler: Registering RDD 6
> (processCmd at CliDriver.java:376)
>
> 16/01/31 22:59:38 INFO scheduler.StatsReportListener:   0%      5%
> 10%     25%     50%     75%     90%     95%     100%
>
> 16/01/31 22:59:38 INFO scheduler.StatsReportListener:   0.0 ms  0.0 ms
> 0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms
>
> 0       999
>
> *Time taken: 7.37 seconds, Fetched 1 row(s)*
>
>
>
> *Actually sounds like for a full table scan on 100 Million rows table
> Spark is on par with Oracle 11g that returns the same results in 7.03
> seconds (three runs) *doing a full table scan as expected
>
>
>
> scratchpad@MYDB.MICH.LOCAL> *select min(scattered), max(randomised) from
> dummy; *
>
>
>
> MIN(SCATTERED) MAX(RANDOMISED)
>
> -------------- ---------------
>
>              0             999
>
>
>
> *Elapsed: 00:00:07.03*
>
>
>
> Execution Plan
>
> ----------------------------------------------------------
>
> Plan hash value: 2937163428
>
>
>
>
> ----------------------------------------------------------------------------
>
> | Id  | Operation          | Name  | Rows  | Bytes | Cost (%CPU)| Time
> |
>
>
> ----------------------------------------------------------------------------
>
> |   0 | SELECT STATEMENT   |       |     1 |     8 |   260K  (1)| 00:52:12
> |
>
> |   1 |  SORT AGGREGATE    |       |     1 |     8 |            |
> |
>
> |   2 |   TABLE ACCESS FULL| DUMMY |   100M|   762M|   260K  (1)| 00:52:12
> |
>
>
> ----------------------------------------------------------------------------
>
>
>
>
>
> Statistics
>
> ----------------------------------------------------------
>
>           0  recursive calls
>
>           0  db block gets
>
>     1347179  consistent gets
>
>     1347168  physical reads
>
>           0  redo size
>
>         612  bytes sent via SQL*Net to client
>
>         523  bytes received via SQL*Net from client
>
>           2  SQL*Net roundtrips to/from client
>
>           0  sorts (memory)
>
>           0  sorts (disk)
>
>           1  rows processed
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>

Mime
View raw message