Return-Path: Delivered-To: apmail-hadoop-core-commits-archive@www.apache.org Received: (qmail 4725 invoked from network); 21 Jan 2009 23:04:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Jan 2009 23:04:27 -0000 Received: (qmail 38993 invoked by uid 500); 21 Jan 2009 23:04:26 -0000 Delivered-To: apmail-hadoop-core-commits-archive@hadoop.apache.org Received: (qmail 38966 invoked by uid 500); 21 Jan 2009 23:04:26 -0000 Mailing-List: contact core-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-commits@hadoop.apache.org Received: (qmail 38957 invoked by uid 99); 21 Jan 2009 23:04:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2009 15:04:26 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2009 23:04:25 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id F179D118DD for ; Wed, 21 Jan 2009 23:04:04 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: core-commits@hadoop.apache.org Date: Wed, 21 Jan 2009 23:04:04 -0000 Message-ID: <20090121230404.1979.28521@eos.apache.org> Subject: [Hadoop Wiki] Update of "Hive/LanguageManual/LanguageManual/Joins" by AshishThusoo X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The following page has been changed by AshishThusoo: http://wiki.apache.org/hadoop/Hive/LanguageManual/LanguageManual/Joins ------------------------------------------------------------------------------ + == Join Syntax == + Hive supports the following syntax for joining tables: + {{{ join_table: table_reference JOIN table_factor [join_condition] @@ -21, +24 @@ expression = expression }}} + Only equality joins and outer joins are supported in Hive. Hive does not support join conditions that are not equality + conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be + joined in Hive. + + Some salient points to consider when writing join queries are as follows: + + * Only equality joins are allowed e.g. + {{{ + SELECT a.* FROM a JOIN b ON (a.id = b.id) + }}} + is a valid join, however + {{{ + SELECT a.* FROM a JOIN b ON (a.id <> b.id) + }}} + is NOT allowed + * More than 2 tables can be joined in the same query e.g. + {{{ + SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) + }}} + is a valid join + * Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses e.g. + {{{ + SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) + }}} + is converted into a single map/reduce job as only key1 column for b is involved in the join. On the other hand + {{{ + SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) + }}} + is converted into two map/reduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one. The first map/reduce job joins a with b and the results are then joined with c in the second map/reduce job. + * In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in + {{{ + SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) + }}} + all the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables a and b are buffered in the memory in the reducers. Then for each row retrieved from c, the join is computed with the buffered rows. Similarly for + {{{ + SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) + }}} + there are two map/reduce jobs involved in computing the join. The first of these joins a with b and buffers the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of c through the reducers. +