hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Hive on Spark Engine versus Spark using Hive metastore
Date Thu, 04 Feb 2016 17:50:30 GMT
fair enough

On Thu, Feb 4, 2016 at 12:41 PM, Edward Capriolo <edlinuxguru@gmail.com>
wrote:

> Hive is not the correct tool for every problem. Use the tool that makes
> the most sense for your problem and your experience.
>
> Many people like hive because it is generally applicable. In my case study
> for the hive book I highlighted many smart capably organizations use hive.
>
> Your argument is totally valid. You like X better because X works for you.
> You don't need to 'preach' hear we all know hive has it's limits.
>
> On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers <koert@tresata.com> wrote:
>
>> Is the sky the limit? I know udfs can be used inside hive, like lambas
>> basically i assume, and i will assume you have something similar for
>> aggregations. But that's just abstractions inside a single map or reduce
>> phase, pretty low level stuff. What you really need is abstractions around
>> many map and reduce phases, because that is the level an algo is expressed
>> at.
>>
>> For example when doing logistic regression you want to be able to do
>> something like:
>> read("somefile").train(settings).write("model")
>> Here train is an eternally defined method that is well tested and could
>> do many map and reduce steps internally (or even be defined at a higher
>> level and compile into those steps). What is the equivalent in hive? Copy
>> pasting crucial parts of the algo around while using udfs is just not the
>> same thing in terms of reusability and abstraction. Its the opposite of
>> keeping it DRY.
>> On Feb 3, 2016 1:06 AM, "Ryan Harris" <Ryan.Harris@zionsbancorp.com>
>> wrote:
>>
>>> https://github.com/myui/hivemall
>>>
>>>
>>>
>>> as long as you are comfortable with java UDFs, the sky is really the
>>> limit...it's not for everyone and spark does have many advantages, but they
>>> are two tools that can complement each other in numerous ways.
>>>
>>>
>>>
>>> I don't know that there is necessarily a universal "better" for how to
>>> use spark as an execution engine (or if spark is necessarily the **best**
>>> execution engine for any given hive job).
>>>
>>>
>>>
>>> The reality is that once you start factoring in the numerous tuning
>>> parameters of the systems and jobs there probably isn't a clear answer.
>>> For some queries, the Catalyst optimizer may do a better job...is it going
>>> to do a better job with ORC based data? less likely IMO.
>>>
>>>
>>>
>>> *From:* Koert Kuipers [mailto:koert@tresata.com]
>>> *Sent:* Tuesday, February 02, 2016 9:50 PM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> yeah but have you ever seen somewhat write a real analytical program in
>>> hive? how? where are the basic abstractions to wrap up a large amount of
>>> operations (joins, groupby's) into a single function call? where are the
>>> tools to write nice unit test for that?
>>>
>>> for example in spark i can write a DataFrame => DataFrame that
>>> internally does many joins, groupBys and complex operations. all unit
>>> tested and perfectly re-usable. and in hive? copy paste round sql queries?
>>> thats just dangerous.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxguru@gmail.com>
>>> wrote:
>>>
>>> Hive has numerous extension points, you are not boxed in by a long shot.
>>>
>>>
>>>
>>> On Tuesday, February 2, 2016, Koert Kuipers <koert@tresata.com> wrote:
>>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzhang@cloudera.com> wrote:
>>>
>>> When comparing the performance, you need to do it apple vs apple. In
>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>> SQL. However, you configured Hive such that only two tasks can run in
>>> parallel. However, you didn't provide information on how much Spark SQL is
>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>> see the resource usage in YARN resource manage URL.
>>>
>>> --Xuefu
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <mich@peridale.co.uk>
>>> wrote:
>>>
>>> Thanks Jeff.
>>>
>>>
>>>
>>> Obviously Hive is much more feature rich compared to Spark. Having said
>>> that in certain areas for example where the SQL feature is available in
>>> Spark, Spark seems to deliver faster.
>>>
>>>
>>>
>>> This may be:
>>>
>>>
>>>
>>> 1.    Spark does both the optimisation and execution seamlessly
>>>
>>> 2.    Hive on Spark has to invoke YARN that adds another layer to the
>>> process
>>>
>>>
>>>
>>> Now I did some simple tests on a 100Million rows ORC table available
>>> through Hive to both.
>>>
>>>
>>>
>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>
>>>
>>>
>>>
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>
>>> 1       0       0       63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>> xxxxxxxxxx
>>>
>>> 5       0       4       31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>> xxxxxxxxxx
>>>
>>> 100000  99      999     188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>> xxxxxxxxxx
>>>
>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>
>>> 1       0       0       63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>> xxxxxxxxxx
>>>
>>> 5       0       4       31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>> xxxxxxxxxx
>>>
>>> 100000  99      999     188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>> xxxxxxxxxx
>>>
>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>
>>> 1       0       0       63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>> xxxxxxxxxx
>>>
>>> 5       0       4       31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>> xxxxxxxxxx
>>>
>>> 100000  99      999     188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>> xxxxxxxxxx
>>>
>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>
>>>
>>>
>>> So three runs returning three rows just over 50 seconds
>>>
>>>
>>>
>>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>>
>>>
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 100000);
>>>
>>> INFO  :
>>>
>>> Query Hive on Spark job[4] stages:
>>>
>>> INFO  : 4
>>>
>>> INFO  :
>>>
>>> Status: Running (Hive on Spark job[4])
>>>
>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> |                 dummy.random_string                 | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | 1         | 0                | 0                | 63                |
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>> xxxxxxxxxx     |
>>>
>>> | 5         | 0                | 4                | 31                |
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>> xxxxxxxxxx     |
>>>
>>> | 100000    | 99               | 999              | 188               |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>> xxxxxxxxxx     |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> 3 rows selected (82.66 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 100000);
>>>
>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> |                 dummy.random_string                 | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | 1         | 0                | 0                | 63                |
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>> xxxxxxxxxx     |
>>>
>>> | 5         | 0                | 4                | 31                |
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>> xxxxxxxxxx     |
>>>
>>> | 100000    | 99               | 999              | 188               |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>> xxxxxxxxxx     |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> 3 rows selected (76.835 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 100000);
>>>
>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> |                 dummy.random_string                 | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | 1         | 0                | 0                | 63                |
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>> xxxxxxxxxx     |
>>>
>>> | 5         | 0                | 4                | 31                |
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>> xxxxxxxxxx     |
>>>
>>> | 100000    | 99               | 999              | 188               |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>> xxxxxxxxxx     |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> 3 rows selected (80.718 seconds)
>>>
>>>
>>>
>>> Three runs returning the same rows in 80 seconds.
>>>
>>>
>>>
>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>> date and that causes this lag.
>>>
>>>
>>>
>>> There are certain queries that one cannot do with Spark. Besides it does
>>> not recognize CHAR fields which is a pain.
>>>
>>>
>>>
>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>
>>>          > SELECT t.calendar_month_desc, c.channel_desc,
>>> SUM(s.amount_sold) AS TotalSales
>>>
>>>          > FROM sales s, times t, channels c
>>>
>>>          > WHERE s.time_id = t.time_id
>>>
>>>          > AND   s.channel_id = c.channel_id
>>>
>>>          > GROUP BY t.calendar_month_desc, c.channel_desc
>>>
>>>          > ;
>>>
>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>
>>> .
>>>
>>> You are likely trying to use an unsupported Hive feature.";
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>> *From:* Xuefu Zhang [mailto:xzhang@cloudera.com]
>>> *Sent:* 02 February 2016 23:12
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> I think the diff is not only about which does optimization but more on
>>> feature parity. Hive on Spark offers all functional features that Hive
>>> offers and these features play out faster. However, Spark SQL is far from
>>> offering this parity as far as I know.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <mich@peridale.co.uk>
>>> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>> optimizer and Spark query engine
>>>
>>>
>>>
>>> With spark using Hive metastore, Spark does both the optimization and
>>> query engine. The only value add is that one can access the underlying Hive
>>> tables from spark-sql etc
>>>
>>>
>>>
>>>
>>>
>>> Is this assessment correct?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sorry this was sent from mobile. Will do less grammar and spell check
>>> than usual.
>>>
>>>
>>> ------------------------------
>>> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
>>> CONFIDENTIAL and may contain information that is privileged and exempt from
>>> disclosure under applicable law. If you are neither the intended recipient
>>> nor responsible for delivering the message to the intended recipient,
>>> please note that any dissemination, distribution, copying or the taking of
>>> any action in reliance upon the message is strictly prohibited. If you have
>>> received this communication in error, please notify the sender immediately.
>>> Thank you.
>>>
>>
>

Mime
View raw message