Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C34CF7EFC for ; Thu, 1 Sep 2011 00:11:37 +0000 (UTC) Received: (qmail 92268 invoked by uid 500); 1 Sep 2011 00:11:37 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 92123 invoked by uid 500); 1 Sep 2011 00:11:36 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 92115 invoked by uid 500); 1 Sep 2011 00:11:36 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 92112 invoked by uid 99); 1 Sep 2011 00:11:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 00:11:36 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 00:11:32 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 829E744BA7 for ; Thu, 1 Sep 2011 00:11:10 +0000 (UTC) Date: Thu, 1 Sep 2011 00:11:10 +0000 (UTC) From: "Charles Chen (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <339829560.5167.1314835870530.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <483778526.3520.1312334187341.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HIVE-2337?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13095= 015#comment-13095015 ]=20 Charles Chen commented on HIVE-2337: ------------------------------------ I've fixed the test output--it seems to be an improvement. > Predicate pushdown erroneously conservative with outer joins > ------------------------------------------------------------ > > Key: HIVE-2337 > URL: https://issues.apache.org/jira/browse/HIVE-2337 > Project: Hive > Issue Type: Bug > Components: Query Processor > Reporter: Charles Chen > Assignee: Charles Chen > Attachments: HIVE-2337v1.patch, HIVE-2337v2.patch, HIVE-2337v3.pa= tch, HIVE-2337v4.patch > > > The predicate pushdown filter is not applying left associativity of joins= correctly in determining possible aliases for pushing predicates. > In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria fo= r pushing aliases is specified as: > {noformat} > /** > * Figures out the aliases for whom it is safe to push predicates bas= ed on > * ANSI SQL semantics For inner join, all predicates for all aliases = can be > * pushed For full outer join, none of the predicates can be pushed a= s that > * would limit the number of rows for join For left outer join, all t= he > * predicates on the left side aliases can be pushed up For right out= er > * join, all the predicates on the right side aliases can be pushed u= p Joins > * chain containing both left and right outer joins are treated as fu= ll > * outer join. [...] > * > * @param op > * Join Operator > * @param rr > * Row resolver > * @return set of qualified aliases > */ > {noformat} > Since hive joins are left associative, something like "a RIGHT OUTER JOIN= b LEFT OUTER JOIN c=C2=A0INNER JOIN d" should be interpreted as "((a RIGHT= OUTER JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases wh= ere joins with both left and right outer joins can have aliases that can be= pushed. Here, aliases b and d are eligible to be pushed up while the curr= ent criteria provide that none are eligible. > Using: > {noformat} > create table t1 (id int, key string, value string); > create table t2 (id int, key string, value string); > create table t3 (id int, key string, value string); > create table t4 (id int, key string, value string); > {noformat} > For example, the query > {noformat} > explain select * from t1 full outer join t2 on t1.id=3Dt2.id join t3 on t= 2.id=3Dt3.id where t3.id=3D20;=20 > {noformat} > currently gives > {noformat} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > t1=20 > TableScan > alias: t1 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 0 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t2=20 > TableScan > alias: t2 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 1 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t3=20 > TableScan > alias: t3 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 2 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > Reduce Operator Tree: > Join Operator > condition map: > Outer Join 0 to 1 > Inner Join 1 to 2 > condition expressions: > 0 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 2 {VALUE._col0} {VALUE._col1} {VALUE._col2} > handleSkewJoin: false > outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _c= ol10, _col11, _col12 > Filter Operator > predicate: > expr: (_col10 =3D 20) > type: boolean > Select Operator > expressions: > expr: _col0 > type: int > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col5 > type: int > expr: _col6 > type: string > expr: _col7 > type: string > expr: _col10 > type: int > expr: _col11 > type: string > expr: _col12 > type: string > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5= , _col6, _col7, _col8 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputForma= t > output format: org.apache.hadoop.hive.ql.io.HiveIgnor= eKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {noformat} > while the correct behavior is to push the filter "t3.id=3D20" down: > {noformat} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > t1=20 > TableScan > alias: t1 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 0 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t2=20 > TableScan > alias: t2 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 1 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t3=20 > TableScan > alias: t3 > Filter Operator > predicate: > expr: (id =3D 20) > type: boolean > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 2 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > Reduce Operator Tree: > Join Operator > condition map: > Outer Join 0 to 1 > Inner Join 1 to 2 > condition expressions: > 0 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 2 {VALUE._col0} {VALUE._col1} {VALUE._col2} > handleSkewJoin: false > outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _c= ol10, _col11, _col12 > Select Operator > expressions: > expr: _col0 > type: int > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col5 > type: int > expr: _col6 > type: string > expr: _col7 > type: string > expr: _col10 > type: int > expr: _col11 > type: string > expr: _col12 > type: string > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, = _col6, _col7, _col8 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: org.apache.hadoop.hive.ql.io.HiveIgnoreK= eyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {noformat} > The current behavior is actually stranger than this: for a left outer joi= n (similarly for a right outer join), hive finds the leftmost alias referre= d to in the *predicates* of left outer joins and rejects any alias to the r= ight of it for pushdown. So in this query the filter "t2.id=3D20" pushed d= own: > {noformat} > explain select * from t1 join t2 on (t1.id=3Dt2.id) left outer join t3 on= (t2.id=3Dt3.id) where t2.id=3D20; > {noformat} > while it isn't here: > {noformat} > explain select * from t1 join t2 on (t1.id=3Dt2.id) left outer join t3 on= (t1.id=3Dt3.id) where t2.id=3D20; > {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira