hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject RE: Hive on Spark Engine versus Spark using Hive metastore
Date Wed, 03 Feb 2016 10:12:12 GMT
Hi all,

 

 

Thanks for all the comments on this thread.

 

The question I put was simply to rectify technically the approaches with Spark on Hive metastore
and Hive using Spark engine.

 

The fact that we have both the benefits of Hive and Spark is tremendous. They both offer in
their own way many opportunities.

 

Hive is billed as a Data Warehouse (DW)  on HDFS. In that respect it does a good job. Among
many it offers many developers who are familiar with SQL to be productive immediacy. This
should not be underestimated. You can set up your copy of your RDBMS table in Hive in no time
and use Sqoop to get the table data into Hive table practically in one command. For many this
is the great attraction of Hive that can be summarised as: 

 

*         Leverage existing SQL skills on Big Data. 

*         You have a choice of metastore for Hive including MySql, Oracle, Sybase and others.


*         You have a choice of plug ins for your engine (MR, Spark, Tez)

*         Ability to do real time analytics on Hive by sending real time transactional movements
from RDBMS tables to Hive via the existing replication technologies. This is very useful

*         Use Sqoop to push data back to DW or RDBMS table

 

One can argue that in a DW the speed is not necessarily the overriding factor. It does not
matter whether a job finishes in two hours or 2.5 hours. Granted some commercial DW solutions
can do the job much faster but at what cost in terms of multiplexing and paying the licensing
fees. Hive is an attractive proposition here. 

 

I can live with most of Hive shortcomings but would like to see the following:

 

*         Hive has the ability to create multiple EXTERRNAL index types on columns. But they
are never used. It would be great if they can be incorporated in what they are supposed to
use. That will speed up processing

*         It will be awesome to have the ability to have some dialect of isql, PL/SQL capabilities
that allow local variables, conditional statements etc to be used in Hive much like other
DW without using Shell scripting, Pig and other tools 

 

 

Spark is great especially for those familiar with Scala and others languages (additional skill
set) that can leverage Spark shell. However, again it comes at a price of having available
memory which is not always the case. Point queries are great. However, if you bring back tons
of rows then the performance degrades as it has to spill to disk. 

 

Big Data space is getting crowded with a lot of products and auxiliary products. I can see
the potential of Spark for other exploratory work. Having said that in fairness, Hive as a
Data Warehouse  does what it says on the tin.

 

 

Thanks again

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: 03 February 2016 09:25
To: user@hive.apache.org
Subject: RE: Hive on Spark Engine versus Spark using Hive metastore

 

Hi Jeff,

 

I only have a two node cluster. Is there anyway one can simulate additional parallel runs
in such an environment thus having more than two maps?

 

thanks

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Xuefu Zhang [mailto:xzhang@cloudera.com] 
Sent: 03 February 2016 02:39
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

Yes, regardless what spark mode you're running in, from Spark AM webui, you should be able
to see how many task are concurrently running. I'm a little surprised to see that your Hive
configuration only allows 2 map tasks to run in parallel. If your cluster has the capacity,
you should parallelize all the tasks to achieve optimal performance. Since I don't know your
Spark SQL configuration, I cannot tell how much parallelism you have over there. Thus, I'm
not sure if your comparison is valid.

--Xuefu

 

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <mich@peridale.co.uk <mailto:mich@peridale.co.uk>
> wrote:

Hi Jeff,

 

In below

 

…. You should be able to see the resource usage in YARN resource manage URL.

 

Just to be clear we are talking about Port 8088/cluster?

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Koert Kuipers [mailto:koert@tresata.com <mailto:koert@tresata.com> ] 
Sent: 03 February 2016 00:09

To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming environment and
you can write real functions, versus just being boxed into some version of sql and limited
udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzhang@cloudera.com <mailto:xzhang@cloudera.com>
> wrote:

When comparing the performance, you need to do it apple vs apple. In another thread, you mentioned
that Hive on Spark is much slower than Spark SQL. However, you configured Hive such that only
two tasks can run in parallel. However, you didn't provide information on how much Spark SQL
is utilizing. Thus, it's hard to tell whether it's just a configuration problem in your Hive
or Spark SQL is indeed faster. You should be able to see the resource usage in YARN resource
manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <mich@peridale.co.uk <mailto:mich@peridale.co.uk>
> wrote:

Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in certain areas
for example where the SQL feature is available in Spark, Spark seems to deliver faster.

 

This may be:

 

1.    Spark does both the optimisation and execution seamlessly

2.    Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi       
       1      xxxxxxxxxx

5       0       4       31      vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA       
       5      xxxxxxxxxx

100000  99      999     188     abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe       
  100000      xxxxxxxxxx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi       
       1      xxxxxxxxxx

5       0       4       31      vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA       
       5      xxxxxxxxxx

100000  99      999     188     abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe       
  100000      xxxxxxxxxx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi       
       1      xxxxxxxxxx

5       0       4       31      vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA       
       5      xxxxxxxxxx

100000  99      999     188     abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe       
  100000      xxxxxxxxxx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 100000);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | dummy.randomised
 |                 dummy.random_string                 | dummy.small_vc  | dummy.padding 
|

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
 |          1      | xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
 |          5      | xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
 |     100000      | xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (82.66 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 100000);

INFO  : Status: Finished successfully in 76.67 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | dummy.randomised
 |                 dummy.random_string                 | dummy.small_vc  | dummy.padding 
|

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
 |          1      | xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
 |          5      | xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
 |     100000      | xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (76.835 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 100000);

INFO  : Status: Finished successfully in 80.54 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | dummy.randomised
 |                 dummy.random_string                 | dummy.small_vc  | dummy.padding 
|

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
 |          1      | xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
 |          5      | xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
 |     100000      | xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (80.718 seconds)

 

Three runs returning the same rows in 80 seconds. 

 

It is possible that My Spark engine with Hive is 1.3.1 which is out of date and that causes
this lag. 

 

There are certain queries that one cannot do with Spark. Besides it does not recognize CHAR
fields which is a pain.

 

spark-sql> CREATE TEMPORARY TABLE tmp AS

         > SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales

         > FROM sales s, times t, channels c

         > WHERE s.time_id = t.time_id

         > AND   s.channel_id = c.channel_id

         > GROUP BY t.calendar_month_desc, c.channel_desc

         > ;

Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7

.

You are likely trying to use an unsupported Hive feature.";

 

 

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Xuefu Zhang [mailto:xzhang@cloudera.com <mailto:xzhang@cloudera.com> ] 
Sent: 02 February 2016 23:12
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

I think the diff is not only about which does optimization but more on feature parity. Hive
on Spark offers all functional features that Hive offers and these features play out faster.
However, Spark SQL is far from offering this parity as far as I know.

 

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <mich@peridale.co.uk <mailto:mich@peridale.co.uk>
> wrote:

Hi,

 

My understanding is that with Hive on Spark engine, one gets the Hive optimizer and Spark
query engine

 

With spark using Hive metastore, Spark does both the optimization and query engine. The only
value add is that one can access the underlying Hive tables from spark-sql etc

 

 

Is this assessment correct?

 

 

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

 

 

 


Mime
View raw message