Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 56A1D18A27 for ; Wed, 3 Feb 2016 09:24:56 +0000 (UTC) Received: (qmail 79256 invoked by uid 500); 3 Feb 2016 09:24:51 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 79189 invoked by uid 500); 3 Feb 2016 09:24:51 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 79179 invoked by uid 99); 3 Feb 2016 09:24:51 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Feb 2016 09:24:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id CC4B0C1BD5 for ; Wed, 3 Feb 2016 09:24:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.1 X-Spam-Level: *** X-Spam-Status: No, score=3.1 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_COUK=1.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ZjvomI8l_j6P for ; Wed, 3 Feb 2016 09:24:47 +0000 (UTC) Received: from sulu.netzoomi.net (sulu.netzoomi.net [83.138.144.103]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTP id 13C39207E1 for ; Wed, 3 Feb 2016 09:24:46 +0000 (UTC) Received: from vulcan.netzoomi.net (unknown [212.100.249.54]) by sulu.netzoomi.net (Postfix) with ESMTP id 7016A6A44D8 for ; Wed, 3 Feb 2016 09:24:44 +0000 (GMT) X-Envelope-From: Received: from w7 (cpc86449-seve24-2-0-cust177.13-3.cable.virginm.net [86.19.59.178]) by vulcan.netzoomi.net (Postfix) with ESMTPA id 219D41248463 for ; Wed, 3 Feb 2016 09:24:44 +0000 (GMT) From: "Mich Talebzadeh" To: References: <075501d15e0a$7bd0a410$7371ec30$@peridale.co.uk> <076801d15e11$dd834e40$9889eac0$@peridale.co.uk> <078301d15e1f$54212330$fc636990$@peridale.co.uk> In-Reply-To: Subject: RE: Hive on Spark Engine versus Spark using Hive metastore Date: Wed, 3 Feb 2016 09:25:15 -0000 Message-ID: <07a201d15e64$ca9d9300$5fd8b900$@peridale.co.uk> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_07A3_01D15E64.CAA35F60" X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQJJDYzAXX7RMcHe+8R+deV4M4TkdgJE+WggAgsPX3wBsTeLvwGergxxArwXQgsBQjV5B53Nu97A Content-Language: en-gb This is a multipart message in MIME format. ------=_NextPart_000_07A3_01D15E64.CAA35F60 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Jeff, =20 I only have a two node cluster. Is there anyway one can simulate = additional parallel runs in such an environment thus having more than = two maps? =20 thanks =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Xuefu Zhang [mailto:xzhang@cloudera.com]=20 Sent: 03 February 2016 02:39 To: user@hive.apache.org Subject: Re: Hive on Spark Engine versus Spark using Hive metastore =20 Yes, regardless what spark mode you're running in, from Spark AM webui, = you should be able to see how many task are concurrently running. I'm a = little surprised to see that your Hive configuration only allows 2 map = tasks to run in parallel. If your cluster has the capacity, you should = parallelize all the tasks to achieve optimal performance. Since I don't = know your Spark SQL configuration, I cannot tell how much parallelism = you have over there. Thus, I'm not sure if your comparison is valid. --Xuefu =20 On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh > wrote: Hi Jeff, =20 In below =20 =E2=80=A6. You should be able to see the resource usage in YARN resource = manage URL. =20 Just to be clear we are talking about Port 8088/cluster? =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Koert Kuipers [mailto:koert@tresata.com = ]=20 Sent: 03 February 2016 00:09 To: user@hive.apache.org =20 Subject: Re: Hive on Spark Engine versus Spark using Hive metastore =20 uuuhm with spark using Hive metastore you actually have a real = programming environment and you can write real functions, versus just = being boxed into some version of sql and limited udfs? =20 On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang > wrote: When comparing the performance, you need to do it apple vs apple. In = another thread, you mentioned that Hive on Spark is much slower than = Spark SQL. However, you configured Hive such that only two tasks can run = in parallel. However, you didn't provide information on how much Spark = SQL is utilizing. Thus, it's hard to tell whether it's just a = configuration problem in your Hive or Spark SQL is indeed faster. You = should be able to see the resource usage in YARN resource manage URL. --Xuefu =20 On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh > wrote: Thanks Jeff. =20 Obviously Hive is much more feature rich compared to Spark. Having said = that in certain areas for example where the SQL feature is available in = Spark, Spark seems to deliver faster. =20 This may be: =20 1. Spark does both the optimisation and execution seamlessly 2. Hive on Spark has to invoke YARN that adds another layer to the = process =20 Now I did some simple tests on a 100Million rows ORC table available = through Hive to both. =20 Spark 1.5.2 on Hive 1.2.1 Metastore =20 =20 spark-sql> select * from dummy where id in (1, 5, 100000); 1 0 0 63 = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 = xxxxxxxxxx 5 0 4 31 = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 = xxxxxxxxxx 100000 99 999 188 = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 = xxxxxxxxxx Time taken: 50.805 seconds, Fetched 3 row(s) spark-sql> select * from dummy where id in (1, 5, 100000); 1 0 0 63 = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 = xxxxxxxxxx 5 0 4 31 = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 = xxxxxxxxxx 100000 99 999 188 = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 = xxxxxxxxxx Time taken: 50.358 seconds, Fetched 3 row(s) spark-sql> select * from dummy where id in (1, 5, 100000); 1 0 0 63 = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 = xxxxxxxxxx 5 0 4 31 = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 = xxxxxxxxxx 100000 99 999 188 = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 = xxxxxxxxxx Time taken: 50.563 seconds, Fetched 3 row(s) =20 So three runs returning three rows just over 50 seconds =20 Hive 1.2.1 on spark 1.3.1 execution engine =20 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in = (1, 5, 100000); INFO : Query Hive on Spark job[4] stages: INFO : 4 INFO : Status: Running (Hive on Spark job[4]) INFO : Status: Finished successfully in 82.49 seconds +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ | dummy.id | dummy.clustered | dummy.scattered | = dummy.randomised | dummy.random_string = | dummy.small_vc | dummy.padding | +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ | 1 | 0 | 0 | 63 | = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | = xxxxxxxxxx | | 5 | 0 | 4 | 31 | = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | = xxxxxxxxxx | | 100000 | 99 | 999 | 188 | = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | = xxxxxxxxxx | +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ 3 rows selected (82.66 seconds) 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in = (1, 5, 100000); INFO : Status: Finished successfully in 76.67 seconds +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ | dummy.id | dummy.clustered | dummy.scattered | = dummy.randomised | dummy.random_string = | dummy.small_vc | dummy.padding | +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ | 1 | 0 | 0 | 63 | = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | = xxxxxxxxxx | | 5 | 0 | 4 | 31 | = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | = xxxxxxxxxx | | 100000 | 99 | 999 | 188 | = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | = xxxxxxxxxx | +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ 3 rows selected (76.835 seconds) 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in = (1, 5, 100000); INFO : Status: Finished successfully in 80.54 seconds +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ | dummy.id | dummy.clustered | dummy.scattered | = dummy.randomised | dummy.random_string = | dummy.small_vc | dummy.padding | +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ | 1 | 0 | 0 | 63 | = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | = xxxxxxxxxx | | 5 | 0 | 4 | 31 | = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | = xxxxxxxxxx | | 100000 | 99 | 999 | 188 | = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | = xxxxxxxxxx | +-----------+------------------+------------------+-------------------+--= ---------------------------------------------------+-----------------+---= -------------+--+ 3 rows selected (80.718 seconds) =20 Three runs returning the same rows in 80 seconds.=20 =20 It is possible that My Spark engine with Hive is 1.3.1 which is out of = date and that causes this lag.=20 =20 There are certain queries that one cannot do with Spark. Besides it does = not recognize CHAR fields which is a pain. =20 spark-sql> CREATE TEMPORARY TABLE tmp AS > SELECT t.calendar_month_desc, c.channel_desc, = SUM(s.amount_sold) AS TotalSales > FROM sales s, times t, channels c > WHERE s.time_id =3D t.time_id > AND s.channel_id =3D c.channel_id > GROUP BY t.calendar_month_desc, c.channel_desc > ; Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 . You are likely trying to use an unsupported Hive feature."; =20 =20 =20 =20 =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Xuefu Zhang [mailto:xzhang@cloudera.com = ]=20 Sent: 02 February 2016 23:12 To: user@hive.apache.org =20 Subject: Re: Hive on Spark Engine versus Spark using Hive metastore =20 I think the diff is not only about which does optimization but more on = feature parity. Hive on Spark offers all functional features that Hive = offers and these features play out faster. However, Spark SQL is far = from offering this parity as far as I know. =20 On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh > wrote: Hi, =20 My understanding is that with Hive on Spark engine, one gets the Hive = optimizer and Spark query engine =20 With spark using Hive metastore, Spark does both the optimization and = query engine. The only value add is that one can access the underlying = Hive tables from spark-sql etc =20 =20 Is this assessment correct? =20 =20 =20 Thanks =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 =20 =20 =20 =20 ------=_NextPart_000_07A3_01D15E64.CAA35F60 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Jeff,

 

I only have a two node cluster. Is there anyway one can = simulate additional parallel runs in such an environment thus having = more than two maps?

 

thanks

 

Dr Mich = Talebzadeh

 

LinkedIn = =C2=A0https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2g= BxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

 

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility.<= o:p>

 

From:<= /b> Xuefu Zhang = [mailto:xzhang@cloudera.com]
Sent: 03 February 2016 = 02:39
To: user@hive.apache.org
Subject: Re: Hive on = Spark Engine versus Spark using Hive metastore

 

Yes, regardless what spark mode you're = running in, from Spark AM webui, you should be able to see how many task = are concurrently running. I'm a little surprised to see that your Hive = configuration only allows 2 map tasks to run in parallel. If your = cluster has the capacity, you should parallelize all the tasks to = achieve optimal performance. Since I don't know your Spark SQL = configuration, I cannot tell how much parallelism you have over there. = Thus, I'm not sure if your comparison is valid.

--Xuefu

 

On Tue, = Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <mich@peridale.co.uk> = wrote:

Hi = Jeff,

 

In = below

 

=E2=80=A6. = You should be able to see the resource usage in YARN resource = manage URL.

 

Just to be = clear we are talking about Port 8088/cluster?

 

Dr Mich = Talebzadeh

 

LinkedIn =  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any = responsibility.

 

From:<= /b> Koert = Kuipers [mailto:koert@tresata.com]
Sent: 03 February = 2016 00:09

To: user@hive.apache.org
Subject: Re: Hive = on Spark Engine versus Spark using Hive = metastore

 <= /o:p>

uuuhm with spark using Hive metastore you = actually have a real programming environment and you can write real = functions, versus just being boxed into some version of sql and limited = udfs?

 <= /o:p>

On Tue, Feb = 2, 2016 at 6:46 PM, Xuefu Zhang <xzhang@cloudera.com> = wrote:

When comparing = the performance, you need to do it apple vs apple. In another thread, = you mentioned that Hive on Spark is much slower than Spark SQL. However, = you configured Hive such that only two tasks can run in parallel. = However, you didn't provide information on how much Spark SQL is = utilizing. Thus, it's hard to tell whether it's just a configuration = problem in your Hive or Spark SQL is indeed faster. You should be able = to see the resource usage in YARN resource manage = URL.

--Xuefu

 <= /o:p>

On Tue, Feb = 2, 2016 at 3:31 PM, Mich Talebzadeh <mich@peridale.co.uk> = wrote:

Thanks = Jeff.

 

Obviously Hive = is much more feature rich compared to Spark. Having said that in certain = areas for example where the SQL feature is available in Spark, Spark = seems to deliver faster.

 

This may = be:

 

1.    Spark does = both the optimisation and execution = seamlessly

2.    Hive on Spark = has to invoke YARN that adds another layer to the = process

 

Now I did some = simple tests on a 100Million rows ORC table available through Hive to = both.

 

Spark 1.5.2 on Hive 1.2.1 = Metastore

 

 

spark-sql> select * = from dummy where id in (1, 5, 100000);

1       = 0       = 0       63      = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi    = ;           = 1      xxxxxxxxxx

5       = 0       = 4       31      = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA    = ;           = 5      xxxxxxxxxx

100000  = 99      999     = 188     = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe    = ;      100000      = xxxxxxxxxx

Time taken: 50.805 = seconds, Fetched 3 row(s)

spark-sql> select * = from dummy where id in (1, 5, 100000);

1       = 0       = 0       63      = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi    = ;           = 1      xxxxxxxxxx

5       = 0       = 4       31      = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA    = ;           = 5      xxxxxxxxxx

100000  = 99      999     = 188     = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe    = ;      100000      = xxxxxxxxxx

Time taken: 50.358 = seconds, Fetched 3 row(s)

spark-sql> select * = from dummy where id in (1, 5, 100000);

1       = 0       = 0       63      = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi    = ;           = 1      xxxxxxxxxx

5       = 0       = 4       31      = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA    = ;           = 5      xxxxxxxxxx

100000  99 =      999     = 188     = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe    = ;      100000      = xxxxxxxxxx

Time taken: 50.563 = seconds, Fetched 3 row(s)

 

So three runs = returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 = execution engine

 

0: = jdbc:hive2://rhes564:10010/default> select * from dummy where id in = (1, 5, 100000);

INFO  :

Query = Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark = job[4])

INFO  : Status: Finished successfully in 82.49 = seconds

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

| dummy.id  | = dummy.clustered  | dummy.scattered  | dummy.randomised  = |            =      = dummy.random_string         =         | dummy.small_vc  | = dummy.padding  |

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

| = 1         | = 0            =     | = 0            =     | = 63            = ;    | = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  = |          = 1      | xxxxxxxxxx     = |

| = 5         | = 0           =      | = 4            =     | = 31            = ;    | = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  = |          = 5      | xxxxxxxxxx     = |

| = 100000    | = 99            = ;   | = 999           &nbs= p;  | = 188           &nbs= p;   | = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  = |     100000      | = xxxxxxxxxx     |

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

3 = rows selected (82.66 seconds)

0: = jdbc:hive2://rhes564:10010/default> select * from dummy where id in = (1, 5, 100000);

INFO  : Status: Finished successfully in 76.67 = seconds

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

| dummy.id  | = dummy.clustered  | dummy.scattered  | dummy.randomised  = |            =      = dummy.random_string         =         | dummy.small_vc  | = dummy.padding  |

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

| = 1         | = 0            =     | = 0            =     | = 63            = ;    | = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  = |          = 1      | xxxxxxxxxx     = |

| = 5         | = 0            =     | = 4            =     | = 31            = ;    | = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  = |          = 5      | xxxxxxxxxx     = |

| = 100000    | 99      =          | = 999           &nbs= p;  | = 188           &nbs= p;   | = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  = |     100000      | = xxxxxxxxxx     |

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

3 = rows selected (76.835 seconds)

0: = jdbc:hive2://rhes564:10010/default> select * from dummy where id in = (1, 5, 100000);

INFO  : Status: Finished successfully in 80.54 = seconds

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

| dummy.id  | = dummy.clustered  | dummy.scattered  | dummy.randomised  = |            =      = dummy.random_string         =         | dummy.small_vc  | = dummy.padding  |

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

| = 1         | = 0            =     | = 0            =     | = 63            = ;    | = rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  = |          = 1      | xxxxxxxxxx   =   |

| = 5         | = 0            =     | = 4            =     | = 31            = ;    | = vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  = |          = 5      | xxxxxxxxxx     = |

| = 100000    | = 99            = ;   | = 999           &nbs= p;  | = 188           &nbs= p;   | = abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  = |     100000      | = xxxxxxxxxx     |

+-----------+------------------+------------------+--= -----------------+-----------------------------------------------------+-= ----------------+----------------+--+

3 = rows selected (80.718 seconds)

 

Three runs = returning the same rows in 80 seconds.

 

It is possible = that My Spark engine with Hive is 1.3.1 which is out of date and that = causes this lag.

 

There are = certain queries that one cannot do with Spark. Besides it does not = recognize CHAR fields which is a pain.

 

spark-sql> CREATE TEMPORARY TABLE tmp = AS

         > = SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS = TotalSales

         > = FROM sales s, times t, channels c

         > = WHERE s.time_id =3D t.time_id

         > = AND   s.channel_id =3D c.channel_id

         > = GROUP BY t.calendar_month_desc, c.channel_desc

         > = ;

Error in = query: Unhandled clauses: TEMPORARY 1, 2,2, 7

.

You are = likely trying to use an unsupported Hive = feature.";

 

 

 

 

 

Dr Mich = Talebzadeh

 

LinkedIn =  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any = responsibility.

 

From:<= /b> Xuefu Zhang = [mailto:xzhang@cloudera.com]
Sent: 02 February = 2016 23:12
To: user@hive.apache.org
Subject: Re: Hive = on Spark Engine versus Spark using Hive = metastore

 <= /o:p>

I think the = diff is not only about which does optimization but more on feature = parity. Hive on Spark offers all functional features that Hive offers = and these features play out faster. However, Spark SQL is far from = offering this parity as far as I know.

 <= /o:p>

On Tue, Feb = 2, 2016 at 2:38 PM, Mich Talebzadeh <mich@peridale.co.uk> = wrote:

Hi,

 

My understanding is that with = Hive on Spark engine, one gets the Hive optimizer and Spark query = engine

 

With spark using Hive = metastore, Spark does both the optimization and query engine. The only = value add is that one can access the underlying Hive tables from = spark-sql etc

 

 

Is this assessment = correct?

 

 

 

Thanks

 

Dr Mich = Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due = shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any = responsibility.

 <= /o:p>

 <= /o:p>

 <= /o:p>

 <= /o:p>

 

------=_NextPart_000_07A3_01D15E64.CAA35F60--