hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Hive on Spark Engine versus Spark using Hive metastore
Date Wed, 03 Feb 2016 04:49:37 GMT
yeah but have you ever seen somewhat write a real analytical program in
hive? how? where are the basic abstractions to wrap up a large amount of
operations (joins, groupby's) into a single function call? where are the
tools to write nice unit test for that?

for example in spark i can write a DataFrame => DataFrame that internally
does many joins, groupBys and complex operations. all unit tested and
perfectly re-usable. and in hive? copy paste round sql queries? thats just
dangerous.

On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxguru@gmail.com>
wrote:

> Hive has numerous extension points, you are not boxed in by a long shot.
>
>
> On Tuesday, February 2, 2016, Koert Kuipers <koert@tresata.com> wrote:
>
>> uuuhm with spark using Hive metastore you actually have a real
>> programming environment and you can write real functions, versus just being
>> boxed into some version of sql and limited udfs?
>>
>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzhang@cloudera.com> wrote:
>>
>>> When comparing the performance, you need to do it apple vs apple. In
>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>> SQL. However, you configured Hive such that only two tasks can run in
>>> parallel. However, you didn't provide information on how much Spark SQL is
>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>> see the resource usage in YARN resource manage URL.
>>>
>>> --Xuefu
>>>
>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <mich@peridale.co.uk>
>>> wrote:
>>>
>>>> Thanks Jeff.
>>>>
>>>>
>>>>
>>>> Obviously Hive is much more feature rich compared to Spark. Having said
>>>> that in certain areas for example where the SQL feature is available in
>>>> Spark, Spark seems to deliver faster.
>>>>
>>>>
>>>>
>>>> This may be:
>>>>
>>>>
>>>>
>>>> 1.    Spark does both the optimisation and execution seamlessly
>>>>
>>>> 2.    Hive on Spark has to invoke YARN that adds another layer to the
>>>> process
>>>>
>>>>
>>>>
>>>> Now I did some simple tests on a 100Million rows ORC table available
>>>> through Hive to both.
>>>>
>>>>
>>>>
>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>
>>>> 1       0       0       63
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>>> xxxxxxxxxx
>>>>
>>>> 5       0       4       31
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>>> xxxxxxxxxx
>>>>
>>>> 100000  99      999     188
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>> xxxxxxxxxx
>>>>
>>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>>
>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>
>>>> 1       0       0       63
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>>> xxxxxxxxxx
>>>>
>>>> 5       0       4       31
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>>> xxxxxxxxxx
>>>>
>>>> 100000  99      999     188
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>> xxxxxxxxxx
>>>>
>>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>>
>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>
>>>> 1       0       0       63
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>>> xxxxxxxxxx
>>>>
>>>> 5       0       4       31
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>>> xxxxxxxxxx
>>>>
>>>> 100000  99      999     188
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>> xxxxxxxxxx
>>>>
>>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>>
>>>>
>>>>
>>>> So three runs returning three rows just over 50 seconds
>>>>
>>>>
>>>>
>>>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>>>
>>>>
>>>>
>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>> (1, 5, 100000);
>>>>
>>>> INFO  :
>>>>
>>>> Query Hive on Spark job[4] stages:
>>>>
>>>> INFO  : 4
>>>>
>>>> INFO  :
>>>>
>>>> Status: Running (Hive on Spark job[4])
>>>>
>>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>> |                 dummy.random_string                 | dummy.small_vc  |
>>>> dummy.padding  |
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> | 1         | 0                | 0                | 63                |
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>>> xxxxxxxxxx     |
>>>>
>>>> | 5         | 0                | 4                | 31                |
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>>> xxxxxxxxxx     |
>>>>
>>>> | 100000    | 99               | 999              | 188               |
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>>> xxxxxxxxxx     |
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> 3 rows selected (82.66 seconds)
>>>>
>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>> (1, 5, 100000);
>>>>
>>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>> |                 dummy.random_string                 | dummy.small_vc  |
>>>> dummy.padding  |
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> | 1         | 0                | 0                | 63                |
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>>> xxxxxxxxxx     |
>>>>
>>>> | 5         | 0                | 4                | 31                |
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>>> xxxxxxxxxx     |
>>>>
>>>> | 100000    | 99               | 999              | 188               |
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>>> xxxxxxxxxx     |
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> 3 rows selected (76.835 seconds)
>>>>
>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>> (1, 5, 100000);
>>>>
>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>> |                 dummy.random_string                 | dummy.small_vc  |
>>>> dummy.padding  |
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> | 1         | 0                | 0                | 63                |
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>>> xxxxxxxxxx     |
>>>>
>>>> | 5         | 0                | 4                | 31                |
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>>> xxxxxxxxxx     |
>>>>
>>>> | 100000    | 99               | 999              | 188               |
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>>> xxxxxxxxxx     |
>>>>
>>>>
>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>
>>>> 3 rows selected (80.718 seconds)
>>>>
>>>>
>>>>
>>>> Three runs returning the same rows in 80 seconds.
>>>>
>>>>
>>>>
>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>> date and that causes this lag.
>>>>
>>>>
>>>>
>>>> There are certain queries that one cannot do with Spark. Besides it
>>>> does not recognize CHAR fields which is a pain.
>>>>
>>>>
>>>>
>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>
>>>>          > SELECT t.calendar_month_desc, c.channel_desc,
>>>> SUM(s.amount_sold) AS TotalSales
>>>>
>>>>          > FROM sales s, times t, channels c
>>>>
>>>>          > WHERE s.time_id = t.time_id
>>>>
>>>>          > AND   s.channel_id = c.channel_id
>>>>
>>>>          > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>
>>>>          > ;
>>>>
>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>
>>>> .
>>>>
>>>> You are likely trying to use an unsupported Hive feature.";
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>
>>>>
>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>
>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>
>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>> 978-0-9759693-0-4*
>>>>
>>>> *Publications due shortly:*
>>>>
>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>> 978-0-9563693-3-8
>>>>
>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>>> one out shortly
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential.
>>>> This message is for the designated recipient only, if you are not the
>>>> intended recipient, you should destroy it immediately. Any information in
>>>> this message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this email
>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>> nor their employees accept any responsibility.
>>>>
>>>>
>>>>
>>>> *From:* Xuefu Zhang [mailto:xzhang@cloudera.com]
>>>> *Sent:* 02 February 2016 23:12
>>>> *To:* user@hive.apache.org
>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>
>>>>
>>>>
>>>> I think the diff is not only about which does optimization but more on
>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>> offers and these features play out faster. However, Spark SQL is far from
>>>> offering this parity as far as I know.
>>>>
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <mich@peridale.co.uk>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>> optimizer and Spark query engine
>>>>
>>>>
>>>>
>>>> With spark using Hive metastore, Spark does both the optimization and
>>>> query engine. The only value add is that one can access the underlying Hive
>>>> tables from spark-sql etc
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Is this assessment correct?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>
>>>>
>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>
>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>
>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>> 978-0-9759693-0-4*
>>>>
>>>> *Publications due shortly:*
>>>>
>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>> 978-0-9563693-3-8
>>>>
>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>>> one out shortly
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential.
>>>> This message is for the designated recipient only, if you are not the
>>>> intended recipient, you should destroy it immediately. Any information in
>>>> this message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this email
>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>> nor their employees accept any responsibility.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>

Mime
View raw message