hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Sprague <sprag...@gmail.com>
Subject Re: Hive on Spark Engine versus Spark using Hive metastore
Date Thu, 04 Feb 2016 04:19:12 GMT
i refuse to take anybody seriously who has a sig file longer than one line
and  that there is just plain repugnant.

On Wed, Feb 3, 2016 at 1:47 PM, Mich Talebzadeh <mich@peridale.co.uk> wrote:

> I just did some further tests joining a 5 million rows FACT tables with 2
> DIMENSION tables.
>
>
>
> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
> TotalSales
>
> FROM sales s, times t, channels c
>
> WHERE s.time_id = t.time_id
>
> AND   s.channel_id = c.channel_id
>
> GROUP BY t.calendar_month_desc, c.channel_desc
>
> ;
>
>
>
>
>
> Hive on Spark crashes, Hive with MR finishes in 85 sec and Spark on Hive
> finishes in 267 sec. I am trying to understand this behaviour
>
>
>
> OK I changed the three below parameters as suggested by Jeff
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
>
>
> *1)    **Hive 1.2.1 on Spark 1.3.1*
>
> It fails. Never completes.
>
>
>
> ERROR : Status: Failed
>
> Error: Error while processing statement: FAILED: Execution Error, return
> code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
> (state=08S01,code=3)
>
>
>
> *2)    **Hive 1.2.1 on MR engine Looks good and completes in 85 sec*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> SELECT t.calendar_month_desc,
> c.channel_desc, SUM(s.amount_sold) AS TotalSales
>
> 0: jdbc:hive2://rhes564:10010/default> FROM sales s, times t, channels c
>
> 0: jdbc:hive2://rhes564:10010/default> WHERE s.time_id = t.time_id
>
> 0: jdbc:hive2://rhes564:10010/default> AND   s.channel_id = c.channel_id
>
> 0: jdbc:hive2://rhes564:10010/default> GROUP BY t.calendar_month_desc,
> c.channel_desc
>
> 0: jdbc:hive2://rhes564:10010/default> ;
>
> INFO  : Execution completed successfully
>
> INFO  : MapredLocal task succeeded
>
> INFO  : Number of reduce tasks not specified. Estimated from input data
> size: 1
>
> INFO  : In order to change the average load for a reducer (in bytes):
>
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>
> INFO  : In order to limit the maximum number of reducers:
>
> INFO  :   set hive.exec.reducers.max=<number>
>
> INFO  : In order to set a constant number of reducers:
>
> INFO  :   set mapreduce.job.reduces=<number>
>
> WARN  : Hadoop command-line option parsing not performed. Implement the
> Tool interface and execute your application with ToolRunner to remedy this.
>
> INFO  : number of splits:1
>
> INFO  : Submitting tokens for job: job_1454534517374_0002
>
> INFO  : The url to track the job:
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Starting Job = job_1454534517374_0002, Tracking URL =
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1454534517374_0002
>
> INFO  : Hadoop job information for Stage-3: number of mappers: 1; number
> of reducers: 1
>
> INFO  : 2016-02-03 21:25:17,769 Stage-3 map = 0%,  reduce = 0%
>
> INFO  : 2016-02-03 21:25:29,103 Stage-3 map = 2%,  reduce = 0%, Cumulative
> CPU 7.52 sec
>
> INFO  : 2016-02-03 21:25:32,205 Stage-3 map = 5%,  reduce = 0%, Cumulative
> CPU 10.19 sec
>
> INFO  : 2016-02-03 21:25:35,295 Stage-3 map = 7%,  reduce = 0%, Cumulative
> CPU 12.69 sec
>
> INFO  : 2016-02-03 21:25:38,392 Stage-3 map = 10%,  reduce = 0%,
> Cumulative CPU 15.2 sec
>
> INFO  : 2016-02-03 21:25:41,502 Stage-3 map = 13%,  reduce = 0%,
> Cumulative CPU 17.31 sec
>
> INFO  : 2016-02-03 21:25:44,600 Stage-3 map = 16%,  reduce = 0%,
> Cumulative CPU 21.55 sec
>
> INFO  : 2016-02-03 21:25:47,691 Stage-3 map = 20%,  reduce = 0%,
> Cumulative CPU 24.32 sec
>
> INFO  : 2016-02-03 21:25:50,786 Stage-3 map = 23%,  reduce = 0%,
> Cumulative CPU 26.3 sec
>
> INFO  : 2016-02-03 21:25:52,858 Stage-3 map = 27%,  reduce = 0%,
> Cumulative CPU 28.52 sec
>
> INFO  : 2016-02-03 21:25:55,948 Stage-3 map = 31%,  reduce = 0%,
> Cumulative CPU 30.65 sec
>
> INFO  : 2016-02-03 21:25:59,032 Stage-3 map = 35%,  reduce = 0%,
> Cumulative CPU 32.7 sec
>
> INFO  : 2016-02-03 21:26:02,120 Stage-3 map = 40%,  reduce = 0%,
> Cumulative CPU 34.69 sec
>
> INFO  : 2016-02-03 21:26:05,217 Stage-3 map = 43%,  reduce = 0%,
> Cumulative CPU 36.67 sec
>
> INFO  : 2016-02-03 21:26:08,310 Stage-3 map = 47%,  reduce = 0%,
> Cumulative CPU 38.78 sec
>
> INFO  : 2016-02-03 21:26:11,408 Stage-3 map = 52%,  reduce = 0%,
> Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 46.83 sec
>
> INFO  : 2016-02-03 21:26:22,787 Stage-3 map = 100%,  reduce = 0%,
> Cumulative CPU 48.46 sec
>
> INFO  : 2016-02-03 21:26:29,030 Stage-3 map = 100%,  reduce = 100%,
> Cumulative CPU 50.01 sec
>
> INFO  : MapReduce Total cumulative CPU time: 50 seconds 10 msec
>
> INFO  : Ended Job = job_1454534517374_0002
>
> +------------------------+-----------------+-------------+--+
>
> | t.calendar_month_desc  | c.channel_desc  | totalsales  |
>
> +------------------------+-----------------+-------------+--+
>
> +------------------------+-----------------+-------------+--+
>
> 150 rows selected (85.67 seconds)
>
>
>
> *3)    **Spark on Hive engine completes in 267 sec*
>
> spark-sql> SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>          > FROM sales s, times t, channels c
>
>          > WHERE s.time_id = t.time_id
>
>          > AND   s.channel_id = c.channel_id
>
>          > GROUP BY t.calendar_month_desc, c.channel_desc
>
>          > ;
>
> Time taken: 267.138 seconds, Fetched 150 row(s)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* 03 February 2016 16:21
> *To:* user@hive.apache.org
> *Subject:* RE: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> OK thanks. These are my new ENV settings based upon the availability of
> resources
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
> These are the new runs after these settings:
>
>
>
> *Spark on Hive (3 consecutive runs)*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 47.987 seconds, Fetched 3 row(s)
>
>
>
> Around 48 seconds
>
>
>
> *Hive on Spark 1.3.1*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in
> (1, 5, 100000);
>
> INFO  :
>
> Query Hive on Spark job[2] stages:
>
> INFO  : 2
>
> INFO  :
>
> Status: Running (Hive on Spark job[2])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO  : 2016-02-03 16:20:50,315 Stage-2_0: 0(+18)/18
>
> INFO  : 2016-02-03 16:20:53,369 Stage-2_0: 0(+18)/18
>
> INFO  : 2016-02-03 16:20:56,478 Stage-2_0: 0(+18)/18
>
> INFO  : 2016-02-03 16:20:58,530 Stage-2_0: 1(+17)/18
>
> INFO  : 2016-02-03 16:21:01,570 Stage-2_0: 1(+17)/18
>
> INFO  : 2016-02-03 16:21:04,680 Stage-2_0: 1(+17)/18
>
> INFO  : 2016-02-03 16:21:07,767 Stage-2_0: 1(+17)/18
>
> INFO  : 2016-02-03 16:21:10,877 Stage-2_0: 1(+17)/18
>
> INFO  : 2016-02-03 16:21:13,941 Stage-2_0: 1(+17)/18
>
> INFO  : 2016-02-03 16:21:17,019 Stage-2_0: 1(+17)/18
>
> INFO  : 2016-02-03 16:21:20,090 Stage-2_0: 3(+15)/18
>
> INFO  : 2016-02-03 16:21:21,138 Stage-2_0: 6(+12)/18
>
> INFO  : 2016-02-03 16:21:22,145 Stage-2_0: 10(+8)/18
>
> INFO  : 2016-02-03 16:21:23,150 Stage-2_0: 14(+4)/18
>
> INFO  : 2016-02-03 16:21:24,154 Stage-2_0: 17(+1)/18
>
> INFO  : 2016-02-03 16:21:26,161 Stage-2_0: 18/18 Finished
>
> INFO  : Status: Finished successfully in 36.88 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (37.161 seconds)
>
>
>
> Around 37 seconds
>
>
>
> Interesting results
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzhang@cloudera.com <xzhang@cloudera.com>]
> *Sent:* 03 February 2016 12:47
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> In YARN or standalone mode, you can set spark.executor.cores to utilize
> all cores on the node. You can also set spark.executor.memory to allocate
> memory for Spark to use. Once you do this, you may only have two executors
> to run your map tasks, but each core in each executor can take up one task,
> increasing parallelism. With this, the eventually limit may come down to
> the bandwidth of your disks in the cluster.
>
> Having said that, a two-node cluster isn't really big enough to do
> performance benchmark. Nevertheless, you still need to configure properly
> to make full use of the cluster.
>
> --Xuefu
>
>
>
> On Wed, Feb 3, 2016 at 1:25 AM, Mich Talebzadeh <mich@peridale.co.uk>
> wrote:
>
> Hi Jeff,
>
>
>
> I only have a two node cluster. Is there anyway one can simulate
> additional parallel runs in such an environment thus having more than two
> maps?
>
>
>
> thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzhang@cloudera.com]
> *Sent:* 03 February 2016 02:39
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> Yes, regardless what spark mode you're running in, from Spark AM webui,
> you should be able to see how many task are concurrently running. I'm a
> little surprised to see that your Hive configuration only allows 2 map
> tasks to run in parallel. If your cluster has the capacity, you should
> parallelize all the tasks to achieve optimal performance. Since I don't
> know your Spark SQL configuration, I cannot tell how much parallelism you
> have over there. Thus, I'm not sure if your comparison is valid.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <mich@peridale.co.uk>
> wrote:
>
> Hi Jeff,
>
>
>
> In below
>
>
>
> …. You should be able to see the resource usage in YARN resource manage
> URL.
>
>
>
> Just to be clear we are talking about Port 8088/cluster?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:koert@tresata.com]
> *Sent:* 03 February 2016 00:09
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzhang@cloudera.com> wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <mich@peridale.co.uk>
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.    Spark does both the optimisation and execution seamlessly
>
> 2.    Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.563 seconds, Fetched 3 row(s)
>
>
>
> So three runs returning three rows just over 50 seconds
>
>
>
> *Hive 1.2.1 on spark 1.3.1 execution engine*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  :
>
> Query Hive on Spark job[4] stages:
>
> INFO  : 4
>
> INFO  :
>
> Status: Running (Hive on Spark job[4])
>
> INFO  : Status: Finished successfully in 82.49 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (82.66 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  : Status: Finished successfully in 76.67 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (76.835 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  : Status: Finished successfully in 80.54 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (80.718 seconds)
>
>
>
> Three runs returning the same rows in 80 seconds.
>
>
>
> It is possible that My Spark engine with Hive is 1.3.1 which is out of
> date and that causes this lag.
>
>
>
> There are certain queries that one cannot do with Spark. Besides it does
> not recognize CHAR fields which is a pain.
>
>
>
> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>
>          > SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>          > FROM sales s, times t, channels c
>
>          > WHERE s.time_id = t.time_id
>
>          > AND   s.channel_id = c.channel_id
>
>          > GROUP BY t.calendar_month_desc, c.channel_desc
>
>          > ;
>
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>
> .
>
> You are likely trying to use an unsupported Hive feature.";
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzhang@cloudera.com]
> *Sent:* 02 February 2016 23:12
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> I think the diff is not only about which does optimization but more on
> feature parity. Hive on Spark offers all functional features that Hive
> offers and these features play out faster. However, Spark SQL is far from
> offering this parity as far as I know.
>
>
>
> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <mich@peridale.co.uk>
> wrote:
>
> Hi,
>
>
>
> My understanding is that with Hive on Spark engine, one gets the Hive
> optimizer and Spark query engine
>
>
>
> With spark using Hive metastore, Spark does both the optimization and
> query engine. The only value add is that one can access the underlying Hive
> tables from spark-sql etc
>
>
>
>
>
> Is this assessment correct?
>
>
>
>
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message