Return-Path: Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: (qmail 34864 invoked from network); 24 Jul 2009 00:02:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Jul 2009 00:02:17 -0000 Received: (qmail 8856 invoked by uid 500); 24 Jul 2009 00:03:23 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 8773 invoked by uid 500); 24 Jul 2009 00:03:22 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 8764 invoked by uid 500); 24 Jul 2009 00:03:22 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 8761 invoked by uid 99); 24 Jul 2009 00:03:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jul 2009 00:03:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jul 2009 00:03:19 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 96128118C1 for ; Fri, 24 Jul 2009 00:02:58 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: core-commits@hadoop.apache.org Date: Fri, 24 Jul 2009 00:02:58 -0000 Message-ID: <20090724000258.16100.22953@eos.apache.org> Subject: [Hadoop Wiki] Update of "Hive/LanguageManual/Joins" by AdamKramer X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The following page has been changed by AdamKramer: http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins ------------------------------------------------------------------------------ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) }}} there are two map/reduce jobs involved in computing the join. The first of these joins a with b and buffers the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of c through the reducers. + * Joins occur BEFORE WHERE CLAUSES. So, if you want to restrict the OUTPUT of a join, a requirement should be in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for this issue is partitioned tables: + {{{ + SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key) + WHERE a.ds='2009-07-07' AND b.ds='2009-07-07' + }}} + will join a on b, producing a list of a.val and b.val. The WHERE clause, however, can also reference other columns of a and b that are in the output of the join, and then filter them out. However, whenever a row from the JOIN has found a key for a and no key for b, all of the columns of b will be NULL, '''including the ds column'''. This is to say, you will filter out all rows of join output for which there was no valid b.key, and thus you have outsmarted your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is irrelevant if you reference any column of b in the WHERE clause. Instead, when OUTER JOINing, use this syntax: + {{{ + SELECT a.val, b.val FROM a LEFT OUTER JOIN b + ON (a.key=b.key AND b.ds='2009-07-07' AND a.ds='2009-07-07') + }}} + ...the result is that the output of the join is pre-filtered, and you won't get post-filtering trouble for rows that have a valid a.key but no matching b.key. The same logic applies to RIGHT and FULL joins.