Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6AA22185C8 for ; Thu, 31 Dec 2015 18:54:42 +0000 (UTC) Received: (qmail 80001 invoked by uid 500); 31 Dec 2015 18:54:41 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 79929 invoked by uid 500); 31 Dec 2015 18:54:40 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 79919 invoked by uid 99); 31 Dec 2015 18:54:40 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Dec 2015 18:54:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 69CC7C0CFB for ; Thu, 31 Dec 2015 18:54:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.101 X-Spam-Level: **** X-Spam-Status: No, score=4.101 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_COUK=1.1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id elsPJGQitsZF for ; Thu, 31 Dec 2015 18:54:28 +0000 (UTC) Received: from sulu.netzoomi.net (sulu.netzoomi.net [83.138.144.103]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTP id B9F7542AC6 for ; Thu, 31 Dec 2015 18:54:27 +0000 (UTC) Received: from vulcan.netzoomi.net (unknown [212.100.249.54]) by sulu.netzoomi.net (Postfix) with ESMTP id 8A43F6A44FC for ; Thu, 31 Dec 2015 18:54:26 +0000 (GMT) X-Envelope-From: Received: from w7 (cpc86449-seve24-2-0-cust177.13-3.cable.virginm.net [86.19.59.178]) by vulcan.netzoomi.net (Postfix) with ESMTPA id 4C2961248292 for ; Thu, 31 Dec 2015 18:54:26 +0000 (GMT) From: "Mich Talebzadeh" To: References: <015001d14227$2e852310$8b8f6930$@peridale.co.uk> <3EDBB066-514E-4C4C-90D1-E1802B05A7FB@gmail.com> <01aa01d14300$4f4afaf0$ede0f0d0$@peridale.co.uk> <2F5F6F5F-FA73-4D85-82BD-3A60281EBD9D@gmail.com> <01e101d14337$071e65f0$155b31d0$@peridale.co.uk> <025d01d143f9$da9683f0$8fc38bd0$@peridale.co.uk> <6990B863-FFCF-4D72-BE98-D7A525337E25@gmail.com> In-Reply-To: <6990B863-FFCF-4D72-BE98-D7A525337E25@gmail.com> Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database Date: Thu, 31 Dec 2015 18:54:33 -0000 Message-ID: <026601d143fc$b00c22e0$102468a0$@peridale.co.uk> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0267_01D143FC.B00E6CD0" X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQKc9qhsJZKxINn8+BYOGrjBQgOVvQJjrsKYAqtTgjUBS+vqHgGpWwNNAvaXKdwBw0FGAAGfzXmmnNsantA= Content-Language: en-gb This is a multipart message in MIME format. ------=_NextPart_000_0267_01D143FC.B00E6CD0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I agree but Spark 1.3.1 on Hive is the only one I have managed to make = it work. Still it is twice as fast as Hive on MapReduce. =20 Just to clarify my understanding is that the optimiser is provided by = Hive and is the same for both executions engines. Is there anything = specific that Spark 1.3.1 lacks compared to Spark 1.5.1 when executing = the query? =20 Thanks =20 =20 Mich Talebzadeh =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any responsibility. =20 From: J=C3=B6rn Franke [mailto:jornfranke@gmail.com]=20 Sent: 31 December 2015 18:44 To: user@hive.apache.org Subject: Re: Running the same query on 1 billion rows fact table in Hive = on Spark compared to Sybase IQ columnar database =20 You are using an old version of Spark and it cannot leverage all = optimizations of Hive, so I think that your conclusion cannot be as easy = as you might think.=20 On 31 Dec 2015, at 19:34, Mich Talebzadeh > wrote: Ok guys. =20 I have not succeeded in installing TEZ. Yet so I can try the query on = TEZ as well. =20 Just to remind that the query is used is pretty common. Get the total = amount sold for each calendar month from sales (I billion rows) and = times=20 =20 SELECT t.calendar_month_desc, SUM(s.amount_sold) FROM sales s, times t WHERE s.time_id =3D t.time_id GROUP BY t.calendar_month_desc; =20 In total 48 rows are returned back Now having thought about It, granted TEZ is going to be faster than MR = as it is basically MR with DAG thrown at it. On the other Spark will = have both DAG and in-memory calculation.=20 =20 =20 The results are as follow: =20 =20 Optimiser Engine Timing = Compression Total Table size =20 Hive MapReduce 4673.035 seconds Snappy = totalSize=3D2678882153 =3D 2.5GB Hive Spark 1.3.1 1578.817 seconds Snappy Columnar Sybase IQ 30.000 seconds Native = 5GB =20 =20 It is pretty obvious that Spark outperforms MapReduce more than twice = even taking into account the number of rows on the FACT table and = frankly I would not have thought that TEZ is going to beat Spark (to be = seen). Having said that Hive storage is twice more efficient but I am = not sure what one can do to improve the performance. Table in Hive is = stored as ORC table and it has crossed my mind that maybe we should = think about storing every column of an ORC table as an index. That may = improve the performance further. =20 HTH =20 =20 Mich Talebzadeh =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any responsibility. =20 From: Marcin Tustin [mailto:mtustin@handybook.com]=20 Sent: 30 December 2015 19:27 To: user@hive.apache.org =20 Subject: Re: Running the same query on 1 billion rows fact table in Hive = on Spark compared to Sybase IQ columnar database =20 I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is = much faster than MR in pretty much all cases. Also, with hive, you'll = make sure you've performed optimizations like aligning ORC stripe sizes = with HDFS block sizes, and concatenated your tables (not so much an = optimization as a must for avoiding the small files problem). =20 On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh > wrote: Thanks again Jorn. =20 =20 Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ = I have compression enabled. The FACT table in IQ (sales) has LF (read = bitmap) indexes on the time_id column. For the dimension table (times) I = have time_id defined as primary key. Also Sybase IQ creates FP (fast = projection) indexes on every column by default. =20 Anyway I am trying to download and build TEZ. Do we know which version = of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha =20 Thanks =20 Mich Talebzadeh =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any responsibility. =20 ------=_NextPart_000_0267_01D143FC.B00E6CD0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

I agree but Spark 1.3.1 on Hive is the only one I have = managed to make it work. Still it is twice as fast as Hive on = MapReduce.

 

Just to clarify my understanding is that the optimiser is = provided by Hive and is the same for both executions engines. Is there = anything specific that Spark 1.3.1 lacks compared to Spark 1.5.1 when = executing the query?

 

Thanks =C2=A0

 

Mich = Talebzadeh

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

 

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any responsibility.<= o:p>

 

From:<= /b> J=C3=B6rn = Franke [mailto:jornfranke@gmail.com]
Sent: 31 December 2015 = 18:44
To: user@hive.apache.org
Subject: Re: Running = the same query on 1 billion rows fact table in Hive on Spark compared to = Sybase IQ columnar database

 

You are = using an old version of Spark and it cannot leverage all optimizations = of Hive, so I think that your conclusion cannot be as easy as you might = think. 


On 31 Dec 2015, at 19:34, Mich = Talebzadeh <mich@peridale.co.uk> = wrote:

Ok guys.

 

I have not succeeded in installing TEZ. Yet so I can try the = query on TEZ as well.

 

Just to remind that the query is used is pretty common. Get = the total amount sold for each calendar month from sales (I billion = rows) and times

 

SELECT t.calendar_month_desc, = SUM(s.amount_sold)

FROM sales s, times t WHERE = s.time_id =3D t.time_id

GROUP BY = t.calendar_month_desc;

 

In total 48 rows are returned back

Now having thought about It, granted TEZ is going to be = faster than MR as it is basically MR with DAG thrown at it. On the other = Spark will have both DAG and in-memory calculation. =

 

 

The results are as follow:

 

 

Optimiser   &= nbsp;         = Engine           &= nbsp;   = Timing           &= nbsp;   = Compression           = Total Table size      =

Hive    &= nbsp;            = MapReduce          &nbs= p;  4673.035 seconds      = Snappy     =            = totalSize=3D2678882153 =3D 2.5GB

Hive    &= nbsp;            = Spark 1.3.1           = 1578.817 seconds      Snappy

Columnar   &nb= sp;          Sybase = IQ            = ;  30.000 seconds      = Native     =            = 5GB

 

<= p class=3DMsoNormal> 

It is pretty obvious that Spark outperforms MapReduce more = than twice even taking into account the number of rows on the FACT table = and frankly I would not have thought that TEZ is going to beat Spark (to = be seen). Having said that Hive storage is twice more efficient but I am = not sure what one can do to improve the performance. Table in Hive is = stored as ORC table and it has crossed my mind that maybe we should = think about storing every column of an ORC table as an index. That may = improve the performance further.

 

HTH

 

 

Mich = Talebzadeh

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any responsibility.

 

From:<= /b> Marcin = Tustin [mailto:mtustin@handybook.com] =
Sent: 30 December 2015 19:27
To: user@hive.apache.org
Subje= ct: Re: Running the same query on 1 billion rows fact table in Hive = on Spark compared to Sybase IQ columnar database

 

I'm = using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that = TEZ is much faster than MR in pretty much all cases. Also, with hive, = you'll make sure you've performed optimizations like aligning ORC stripe = sizes with HDFS block sizes, and concatenated your tables (not so much = an optimization as a must for avoiding the small files = problem).

 

On Wed, = Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mich@peridale.co.uk> = wrote:

Thanks again = Jorn.

 

 

Both Hive and Sybase IQ are = running on the same host. Yes for Sybase IQ I have compression enabled. = The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id = column. For the dimension table (times) I have time_id defined as = primary key. Also Sybase IQ creates FP (fast projection) indexes on = every column by default.

 

Anyway I am trying to download = and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 = please? 0.8 seems to be in alpha

 

Thanks

 

Mich = Talebzadeh

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due = shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any = responsibility.

 

------=_NextPart_000_0267_01D143FC.B00E6CD0--