hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject RE: Join vs. Where...In
Date Sat, 29 Aug 2015 19:43:39 GMT
In RDBMS (say Oracle or Sybase) a hash join comes to play when you tend to join two tables
with a large set of information (large being a relative term). The smaller of two tables is
hashed into memory.  

 

For Hive, I have two tables one called  t with 1.7 million rows and another called smallt
with 100 rows. Using three ways of joining the two tables on object_id column

 

1)    WHERE EXISTS

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t WHERE EXISTS (select 1 from
smallt where t.object_id = smallt.object_id);

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (66.369 seconds)

 

2)    IN

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t where t.object_id IN (select
smallt.object_id from smallt);

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (66.158 seconds)

 

3)    Classic Join

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t, smallt where t.object_id
= smallt.object_id;

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (68.978 seconds)

 

 

You can see the results and judge for yourself

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 

From: Raviv Murciano-Goroff [mailto:ravivmg@gmail.com] 
Sent: 29 August 2015 00:50
To: user@hive.apache.org
Subject: Join vs. Where...In

 

Hi,

 

I often have the following situation: I have a small table with a list of unique IDs and a
very large table of events associated with the IDs. I want to perform some aggregation including
only events associated with IDs from the small table.

 

Is there a rule of thumb for whether performing a JOIN on the unique ID is faster or slower
than using WHERE id IN (SELECT id FROM small_table...)?

 

Thank you for your advice,

Raviv


Mime
View raw message