hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Hive on Spark Engine versus Spark using Hive metastore
Date Wed, 03 Feb 2016 14:51:44 GMT
1) spark bundles hive-metastore and hive-exec to get access to the
metastore and serdes. and i am pretty sure they would like to reduce this
if they could given the kitchensink of dependencies that hive is, but that
is not easy since hive was never written as re-usable java libraries. i
imagine that ideally spark would use hcatalog.
2) i dont know much about catalyst sauce... i do think scala lends itself
somewhat better than java to writing such a thing. tez is interesting to me
as well but again i would avoid hive, since there is more interesting stuff
to do on the world than ETL and data warehousing. scalding on tez would be
my choice.


On Wed, Feb 3, 2016 at 9:27 AM, Edward Capriolo <edlinuxguru@gmail.com>
wrote:

> Thank you for the speech. There is an infinite list of things hive does
> not do/cant to well.
> There is an infinite list of things spark does not do /cant do well.
>
> Some facts:
> 1) spark has a complete fork of hive inside it. So its hard to trash hive
> without at least noting the fact that its a portion of sparks guts.
> 2) there were lots of people touting benchmarks about spark sql beating
> hive, lots of fud about catalyst awesome sause. But then it seems like hive
> and tez made spark say uncle...
>
> https://www.slideshare.net/mobile/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final
>
>
> On Wednesday, February 3, 2016, Koert Kuipers <koert@tresata.com> wrote:
>
>> ok i am sure there is some way to do it. i am going to guess snippets of
>> hive code stuck together with oozie jobs or whatever. the oozie jobs become
>> the re-usable pieces perhaps? now you got sql and xml, completely lacking
>> any benefits of a compiler to catch errors. unit tests will be slow if even
>> available at all. so yeah
>> yeah i am sure it can be made to *work*. just like you can get a nail
>> into a wall with a screwdriver if you really want.
>>
>> On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> yeah but have you ever seen somewhat write a real analytical program in
>>> hive? how? where are the basic abstractions to wrap up a large amount of
>>> operations (joins, groupby's) into a single function call? where are the
>>> tools to write nice unit test for that?
>>>
>>> for example in spark i can write a DataFrame => DataFrame that
>>> internally does many joins, groupBys and complex operations. all unit
>>> tested and perfectly re-usable. and in hive? copy paste round sql queries?
>>> thats just dangerous.
>>>
>>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxguru@gmail.com>
>>> wrote:
>>>
>>>> Hive has numerous extension points, you are not boxed in by a long shot.
>>>>
>>>>
>>>> On Tuesday, February 2, 2016, Koert Kuipers <koert@tresata.com> wrote:
>>>>
>>>>> uuuhm with spark using Hive metastore you actually have a real
>>>>> programming environment and you can write real functions, versus just
being
>>>>> boxed into some version of sql and limited udfs?
>>>>>
>>>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzhang@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> When comparing the performance, you need to do it apple vs apple.
In
>>>>>> another thread, you mentioned that Hive on Spark is much slower than
Spark
>>>>>> SQL. However, you configured Hive such that only two tasks can run
in
>>>>>> parallel. However, you didn't provide information on how much Spark
SQL is
>>>>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>>>>> problem in your Hive or Spark SQL is indeed faster. You should be
able to
>>>>>> see the resource usage in YARN resource manage URL.
>>>>>>
>>>>>> --Xuefu
>>>>>>
>>>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <mich@peridale.co.uk>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Jeff.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Obviously Hive is much more feature rich compared to Spark. Having
>>>>>>> said that in certain areas for example where the SQL feature
is available
>>>>>>> in Spark, Spark seems to deliver faster.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This may be:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1.    Spark does both the optimisation and execution seamlessly
>>>>>>>
>>>>>>> 2.    Hive on Spark has to invoke YARN that adds another layer
to
>>>>>>> the process
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Now I did some simple tests on a 100Million rows ORC table available
>>>>>>> through Hive to both.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>>>>
>>>>>>> 1       0       0       63
>>>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi          
    1
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> 5       0       4       31
>>>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA          
    5
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> 100000  99      999     188
>>>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>>>>>
>>>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>>>>
>>>>>>> 1       0       0       63
>>>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi          
    1
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> 5       0       4       31
>>>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA          
    5
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> 100000  99      999     188
>>>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>>>>>
>>>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>>>>
>>>>>>> 1       0       0       63
>>>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi          
    1
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> 5       0       4       31
>>>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA          
    5
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> 100000  99      999     188
>>>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>>>>> xxxxxxxxxx
>>>>>>>
>>>>>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So three runs returning three rows just over 50 seconds
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy
where id
>>>>>>> in (1, 5, 100000);
>>>>>>>
>>>>>>> INFO  :
>>>>>>>
>>>>>>> Query Hive on Spark job[4] stages:
>>>>>>>
>>>>>>> INFO  : 4
>>>>>>>
>>>>>>> INFO  :
>>>>>>>
>>>>>>> Status: Running (Hive on Spark job[4])
>>>>>>>
>>>>>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  |                 dummy.random_string     
           |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> | 1         | 0                | 0                |
>>>>>>> 63                | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
>>>>>>> |          1      | xxxxxxxxxx     |
>>>>>>>
>>>>>>> | 5         | 0                | 4                |
>>>>>>> 31                | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
>>>>>>> |          5      | xxxxxxxxxx     |
>>>>>>>
>>>>>>> | 100000    | 99               | 999              |
>>>>>>> 188               | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
>>>>>>> |     100000      | xxxxxxxxxx     |
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> 3 rows selected (82.66 seconds)
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy
where id
>>>>>>> in (1, 5, 100000);
>>>>>>>
>>>>>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  |                 dummy.random_string     
           |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> | 1         | 0                | 0                |
>>>>>>> 63                | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
>>>>>>> |          1      | xxxxxxxxxx     |
>>>>>>>
>>>>>>> | 5         | 0                | 4                |
>>>>>>> 31                | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
>>>>>>> |          5      | xxxxxxxxxx     |
>>>>>>>
>>>>>>> | 100000    | 99               | 999              |
>>>>>>> 188               | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
>>>>>>> |     100000      | xxxxxxxxxx     |
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> 3 rows selected (76.835 seconds)
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy
where id
>>>>>>> in (1, 5, 100000);
>>>>>>>
>>>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  |                 dummy.random_string     
           |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> | 1         | 0                | 0                |
>>>>>>> 63                | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
>>>>>>> |          1      | xxxxxxxxxx     |
>>>>>>>
>>>>>>> | 5         | 0                | 4                |
>>>>>>> 31                | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
>>>>>>> |          5      | xxxxxxxxxx     |
>>>>>>>
>>>>>>> | 100000    | 99               | 999              |
>>>>>>> 188               | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
>>>>>>> |     100000      | xxxxxxxxxx     |
>>>>>>>
>>>>>>>
>>>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>>>
>>>>>>> 3 rows selected (80.718 seconds)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Three runs returning the same rows in 80 seconds.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It is possible that My Spark engine with Hive is 1.3.1 which
is out
>>>>>>> of date and that causes this lag.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There are certain queries that one cannot do with Spark. Besides
it
>>>>>>> does not recognize CHAR fields which is a pain.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>>>>
>>>>>>>          > SELECT t.calendar_month_desc, c.channel_desc,
>>>>>>> SUM(s.amount_sold) AS TotalSales
>>>>>>>
>>>>>>>          > FROM sales s, times t, channels c
>>>>>>>
>>>>>>>          > WHERE s.time_id = t.time_id
>>>>>>>
>>>>>>>          > AND   s.channel_id = c.channel_id
>>>>>>>
>>>>>>>          > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>>>>
>>>>>>>          > ;
>>>>>>>
>>>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>>>
>>>>>>> A Winning Strategy: Running the most Critical Financial Data
on ASE
>>>>>>> 15
>>>>>>>
>>>>>>>
>>>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>>>
>>>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to
Sybase
>>>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>>>
>>>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>>>> 978-0-9759693-0-4*
>>>>>>>
>>>>>>> *Publications due shortly:*
>>>>>>>
>>>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>>>> 978-0-9563693-3-8
>>>>>>>
>>>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN:
>>>>>>> 978-0-9563693-1-4, volume one out shortly
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>>>> This message is for the designated recipient only, if you are
not the
>>>>>>> intended recipient, you should destroy it immediately. Any information
in
>>>>>>> this message shall not be understood as given or endorsed by
Peridale
>>>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly
so
>>>>>>> stated. It is the responsibility of the recipient to ensure that
this email
>>>>>>> is virus free, therefore neither Peridale Technology Ltd, its
subsidiaries
>>>>>>> nor their employees accept any responsibility.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Xuefu Zhang [mailto:xzhang@cloudera.com]
>>>>>>> *Sent:* 02 February 2016 23:12
>>>>>>> *To:* user@hive.apache.org
>>>>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive
>>>>>>> metastore
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think the diff is not only about which does optimization but
more
>>>>>>> on feature parity. Hive on Spark offers all functional features
that Hive
>>>>>>> offers and these features play out faster. However, Spark SQL
is far from
>>>>>>> offering this parity as far as I know.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <mich@peridale.co.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My understanding is that with Hive on Spark engine, one gets
the
>>>>>>> Hive optimizer and Spark query engine
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With spark using Hive metastore, Spark does both the optimization
>>>>>>> and query engine. The only value add is that one can access the
underlying
>>>>>>> Hive tables from spark-sql etc
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Is this assessment correct?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>>>
>>>>>>> A Winning Strategy: Running the most Critical Financial Data
on ASE
>>>>>>> 15
>>>>>>>
>>>>>>>
>>>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>>>
>>>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to
Sybase
>>>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>>>
>>>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>>>> 978-0-9759693-0-4*
>>>>>>>
>>>>>>> *Publications due shortly:*
>>>>>>>
>>>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>>>> 978-0-9563693-3-8
>>>>>>>
>>>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN:
>>>>>>> 978-0-9563693-1-4, volume one out shortly
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>>>> This message is for the designated recipient only, if you are
not the
>>>>>>> intended recipient, you should destroy it immediately. Any information
in
>>>>>>> this message shall not be understood as given or endorsed by
Peridale
>>>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly
so
>>>>>>> stated. It is the responsibility of the recipient to ensure that
this email
>>>>>>> is virus free, therefore neither Peridale Technology Ltd, its
subsidiaries
>>>>>>> nor their employees accept any responsibility.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Sorry this was sent from mobile. Will do less grammar and spell check
>>>> than usual.
>>>>
>>>
>>>
>>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>

Mime
View raw message