hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/Joins" by AdamKramer
Date Fri, 13 Aug 2010 21:22:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/Joins" page has been changed by AdamKramer.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins?action=diff&rev1=20&rev2=21

--------------------------------------------------

  <<TableOfContents>>
  
+ == Join Syntax ==
- ## page was renamed from Hive/LanguageManual/LanguageManual/Joins
- == THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax ==
  Hive supports the following syntax for joining tables:
  
  {{{
@@ -113, +112 @@

    JOIN b ON (a.key = b.key)
    LEFT OUTER JOIN c ON (a.key = c.key)
  }}}
-  . ..first joins a on b, throwing away everything in a or b that does not have a corresponding
key in the other table. The reduced table is then joined on c. This provides unintuitive results
if there is a key that exists in both a and c, but not b: The whole row (including a.val1,
a.val2, and a.key) is dropped in the "a JOIN b" step, so when the result of that is joined
with c, any row with a c.key that had a corresponding a.key or b.key (but not both) will show
up as NULL, NULL, NULL, c.val.
+  ...first joins a on b, throwing away everything in a or b that does not have a corresponding
key in the other table. The reduced table is then joined on c. This provides unintuitive results
if there is a key that exists in both a and c but not b: The whole row (including a.val1,
a.val2, and a.key) is dropped in the "a JOIN b" step because it is not in b. The result does
not have a.key in it, so when it is LEFT OUTER JOINed with c, c.val does not make it in because
there is no c.key that matches an a.key (because that row from a was removed). Similarly,
if this were a RIGHT OUTER JOIN (instead of LEFT), we would end up with an even weirder effect:
NULL, NULL, NULL, c.val, because even though we specified a.key=c.key as the join key, we
dropped all rows of a that did not match the first JOIN.
+ 
+ To achieve the more intuitive effect, we should instead do FROM c LEFT OUTER JOIN a ON (c.key
= a.key) LEFT OUTER JOIN b ON (c.key = b.key).
  
   * LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient
way. Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries
using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side
table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses
etc.
  

Mime
View raw message