hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <mdefoinplatel....@orange.com>
Subject RE: LEFT OUTER JOIN and partition
Date Tue, 03 Jan 2012 07:52:56 GMT
Hi Mark,
First of all, Happy New Year !
And sorry for the late answer...

I had the hive.optimize.ppd option set to true already so it explains why the query works
fast :
SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND B.dt=X) WHERE A.dt=X AND B.id
IS NULL;


I am relatively happy with this query now but nevertheless I feel like this situation is not
very intuitive and should be documented somewhere.

Cheers,
Michael

-----Message d'origine-----
De : Mark Grover [mailto:mgrover@oanda.com] 
Envoyé : jeudi 22 décembre 2011 19:36
À : user@hive.apache.org
Objet : Re: LEFT OUTER JOIN and partition

Hi Michael,
Thank you for the detailed explanation.

I wouldn't have expected 
SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND A.dt=X AND B.dt=X) WHERE B.id
IS NULL; 
to give
|A(dt=*)-B(dt=X)| 


Can you try the following?:
set hive.optimize.ppd=true;
SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND B.dt=X) WHERE A.dt=X and B.id
IS NULL; 

A couple things about the above queries:
1) Depending on what version of Hive you have, you might have Predicate Pushdown(ppd) already
enabled by default but it doesn't hurt to enable it explicitly:-)
2) In left joins, the optimization only happens on the left table (in your case, A) and will
pull the A.dt=X "up" to filter out the rows before the join takes place. 

Let me know what you get.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "mdefoinplatel ext" <mdefoinplatel.ext@orange.com>
To: user@hive.apache.org
Sent: Thursday, December 22, 2011 11:37:30 AM
Subject: RE: LEFT OUTER JOIN and partition




I have been looking at this problem in more details… 



So let say you have to tables A and B containing distinct Ids and partitioned by date. Let
say for each partition, B is a subset of A (all the rows in B are present in A) 

If you want to compute the size of set difference A – B for a given partition dt=X, it is
recommend (on this list) to do : 



SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND A.dt=X AND B.dt=X) WHERE B.id
IS NULL; 



But actually this doesn’t work as expected (at least by me ;) ) and the query above does
not give you the expected count |A(dt=X)-B(dt=X)| but this one |A(dt=*)-B(dt=X)| 



I made some further tests : 



This query works (it gives the expected count) 

SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND A.dt=B.dt) WHERE A.dt=X AND
B.id IS NULL; 

but it is not optimal since it performs first a join on all the partitions and only in the
WHERE close it filters out the partition you are interested in. 



This query works for me and is much faster: 

SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND B.dt=X) WHERE A.dt=X AND B.id
IS NULL; 



Everything works as if the restriction in the JOIN only apply to the RIGHT table… 



I don’t know if this is a bug or if my original understanding of the LEFT OUTER JOIN was
wrong but I am pretty sure I am not the only one falling into that trap ! 



Cheers, 

Michael 









De : mdefoinplatel.ext@orange.com [mailto:mdefoinplatel.ext@orange.com] 
Envoyé : jeudi 22 décembre 2011 14:29 
À : hive-user@hadoop.apache.org 
Objet : LEFT OUTER JOIN and partition 



Hi folks, 

Ok, I give you the context first : 



I have seen in the mailing list people advising the use of a … LEFT OUTER JOIN ON … followed
by a WHERE close to write a NOT IN type of query. 



So to make it clear, it should be possible to rewrite the following query: 

SELECT a.id FROM a WHERE a.id NOT IN (SELECT b.id FROM b); 



Into the following one: 

SELECT a.id FROM a LEFT OUTER JOIN b ON (a.id=b.id) WHERE b.id IS NULL; 



Ok fine, but what is going on in case of partitioned tables ? 



Suppose a and b are partitioned on dt and you want to write the query only for a given partition:


SELECT a.id FROM a LEFT OUTER JOIN b ON (a.id=b.id AND a.dt=X AND b.dt=X) WHERE b.id IS NULL;




As far as I am concerned, the restriction a.dt=X does not produce the expected results and
I will get rows with a.dt different from X … 

So of course I can restrict the output to the partitions I am interested in with a WHERE clause
but this looks very odd to me. 



So my questions are : 

1) Am I the only one experience this behavior ? 

2) Since the JOIN operation is performed before the WHERE one then does it really means that
the join will take place on all the partitions ( in which case this type of query will not
be tractable in practice) ? 

3) Is there any other way to write a NOT IN query in Hive and more generally how to perform
a set difference operation in Hive ? 



Sorry for the long email… 



Cheers, 

Michael 



_________________________________________________________________________________________________________________________
Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees
et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les
pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom
- Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci
This message and its attachments may contain confidential or privileged information that may
be protected by law; they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete this message
and its attachments. As emails may be altered, France Telecom - Orange shall not be liable
if this message was modified, changed or falsified. Thank you. _________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees
et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par
erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant
susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou
falsifie. Merci

This message and its attachments may contain confidential or privileged information that may
be protected by law;
they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete this message
and its attachments.
As emails may be altered, France Telecom - Orange shall not be liable if this message was
modified, changed or falsified.
Thank you. 

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees
et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par
erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant
susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou
falsifie. Merci

This message and its attachments may contain confidential or privileged information that may
be protected by law;
they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete this message
and its attachments.
As emails may be altered, France Telecom - Orange shall not be liable if this message was
modified, changed or falsified.
Thank you.

Mime
View raw message