hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Date Thu, 31 Dec 2015 18:54:33 GMT
I agree but Spark 1.3.1 on Hive is the only one I have managed to make it work. Still it is
twice as fast as Hive on MapReduce.

 

Just to clarify my understanding is that the optimiser is provided by Hive and is the same
for both executions engines. Is there anything specific that Spark 1.3.1 lacks compared to
Spark 1.5.1 when executing the query?

 

Thanks  

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 

From: Jörn Franke [mailto:jornfranke@gmail.com] 
Sent: 31 December 2015 18:44
To: user@hive.apache.org
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared
to Sybase IQ columnar database

 

You are using an old version of Spark and it cannot leverage all optimizations of Hive, so
I think that your conclusion cannot be as easy as you might think. 


On 31 Dec 2015, at 19:34, Mich Talebzadeh <mich@peridale.co.uk <mailto:mich@peridale.co.uk>
> wrote:

Ok guys.

 

I have not succeeded in installing TEZ. Yet so I can try the query on TEZ as well.

 

Just to remind that the query is used is pretty common. Get the total amount sold for each
calendar month from sales (I billion rows) and times 

 

SELECT t.calendar_month_desc, SUM(s.amount_sold)

FROM sales s, times t WHERE s.time_id = t.time_id

GROUP BY t.calendar_month_desc;

 

In total 48 rows are returned back

Now having thought about It, granted TEZ is going to be faster than MR as it is basically
MR with DAG thrown at it. On the other Spark will have both DAG and in-memory calculation.


 

 

The results are as follow:

 

 

Optimiser             Engine               Timing               Compression           Total
Table size      

Hive                 MapReduce             4673.035 seconds      Snappy                totalSize=2678882153
= 2.5GB

Hive                 Spark 1.3.1           1578.817 seconds      Snappy

Columnar              Sybase IQ              30.000 seconds      Native                5GB

 

 

It is pretty obvious that Spark outperforms MapReduce more than twice even taking into account
the number of rows on the FACT table and frankly I would not have thought that TEZ is going
to beat Spark (to be seen). Having said that Hive storage is twice more efficient but I am
not sure what one can do to improve the performance. Table in Hive is stored as ORC table
and it has crossed my mind that maybe we should think about storing every column of an ORC
table as an index. That may improve the performance further.

 

HTH

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com] 
Sent: 30 December 2015 19:27
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared
to Sybase IQ columnar database

 

I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR
in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations
like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so
much an optimization as a must for avoiding the small files problem).

 

On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mich@peridale.co.uk <mailto:mich@peridale.co.uk>
> wrote:

Thanks again Jorn.

 

 

Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression
enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column.
For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates
FP (fast projection) indexes on every column by default.

 

Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive
1.2.1 please? 0.8 seems to be in alpha

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 


Mime
View raw message