Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4B19673D8 for ; Wed, 3 Aug 2011 21:15:51 +0000 (UTC) Received: (qmail 98409 invoked by uid 500); 3 Aug 2011 21:15:51 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 98346 invoked by uid 500); 3 Aug 2011 21:15:49 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 98338 invoked by uid 500); 3 Aug 2011 21:15:49 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 98335 invoked by uid 99); 3 Aug 2011 21:15:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Aug 2011 21:15:49 +0000 X-ASF-Spam-Status: No, hits=-2000.7 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Aug 2011 21:15:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 49A71A7CE2 for ; Wed, 3 Aug 2011 21:15:27 +0000 (UTC) Date: Wed, 3 Aug 2011 21:15:27 +0000 (UTC) From: "Charles Chen (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <2038974594.6032.1312406127297.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <483778526.3520.1312334187341.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-2337?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Chen updated HIVE-2337: ------------------------------- Attachment: HIVE-2337v3.patch > Predicate pushdown erroneously conservative with outer joins > ------------------------------------------------------------ > > Key: HIVE-2337 > URL: https://issues.apache.org/jira/browse/HIVE-2337 > Project: Hive > Issue Type: Bug > Components: Query Processor > Reporter: Charles Chen > Assignee: Charles Chen > Attachments: HIVE-2337v1.patch, HIVE-2337v2.patch, HIVE-2337v3.pa= tch > > > The predicate pushdown filter is not applying left associativity of joins= correctly in determining possible aliases for pushing predicates. > In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria fo= r pushing aliases is specified as: > {noformat} > /** > * Figures out the aliases for whom it is safe to push predicates bas= ed on > * ANSI SQL semantics For inner join, all predicates for all aliases = can be > * pushed For full outer join, none of the predicates can be pushed a= s that > * would limit the number of rows for join For left outer join, all t= he > * predicates on the left side aliases can be pushed up For right out= er > * join, all the predicates on the right side aliases can be pushed u= p Joins > * chain containing both left and right outer joins are treated as fu= ll > * outer join. [...] > * > * @param op > * Join Operator > * @param rr > * Row resolver > * @return set of qualified aliases > */ > {noformat} > Since hive joins are left associative, something like "a RIGHT OUTER JOIN= b LEFT OUTER JOIN c=C2=A0INNER JOIN d" should be interpreted as "((a RIGHT= OUTER JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases wh= ere joins with both left and right outer joins can have aliases that can be= pushed. Here, aliases b and d are eligible to be pushed up while the curr= ent criteria provide that none are eligible. > Using: > {noformat} > create table t1 (id int, key string, value string); > create table t2 (id int, key string, value string); > create table t3 (id int, key string, value string); > create table t4 (id int, key string, value string); > {noformat} > For example, the query > {noformat} > explain select * from t1 full outer join t2 on t1.id=3Dt2.id join t3 on t= 2.id=3Dt3.id where t3.id=3D20;=20 > {noformat} > currently gives > {noformat} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > t1=20 > TableScan > alias: t1 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 0 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t2=20 > TableScan > alias: t2 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 1 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t3=20 > TableScan > alias: t3 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 2 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > Reduce Operator Tree: > Join Operator > condition map: > Outer Join 0 to 1 > Inner Join 1 to 2 > condition expressions: > 0 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 2 {VALUE._col0} {VALUE._col1} {VALUE._col2} > handleSkewJoin: false > outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _c= ol10, _col11, _col12 > Filter Operator > predicate: > expr: (_col10 =3D 20) > type: boolean > Select Operator > expressions: > expr: _col0 > type: int > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col5 > type: int > expr: _col6 > type: string > expr: _col7 > type: string > expr: _col10 > type: int > expr: _col11 > type: string > expr: _col12 > type: string > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5= , _col6, _col7, _col8 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputForma= t > output format: org.apache.hadoop.hive.ql.io.HiveIgnor= eKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {noformat} > while the correct behavior is to push the filter "t3.id=3D20" down: > {noformat} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > t1=20 > TableScan > alias: t1 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 0 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t2=20 > TableScan > alias: t2 > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 1 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > t3=20 > TableScan > alias: t3 > Filter Operator > predicate: > expr: (id =3D 20) > type: boolean > Reduce Output Operator > key expressions: > expr: id > type: int > sort order: + > Map-reduce partition columns: > expr: id > type: int > tag: 2 > value expressions: > expr: id > type: int > expr: key > type: string > expr: value > type: string > Reduce Operator Tree: > Join Operator > condition map: > Outer Join 0 to 1 > Inner Join 1 to 2 > condition expressions: > 0 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} > 2 {VALUE._col0} {VALUE._col1} {VALUE._col2} > handleSkewJoin: false > outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _c= ol10, _col11, _col12 > Select Operator > expressions: > expr: _col0 > type: int > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col5 > type: int > expr: _col6 > type: string > expr: _col7 > type: string > expr: _col10 > type: int > expr: _col11 > type: string > expr: _col12 > type: string > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, = _col6, _col7, _col8 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: org.apache.hadoop.hive.ql.io.HiveIgnoreK= eyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {noformat} > The current behavior is actually stranger than this: for a left outer joi= n (similarly for a right outer join), hive finds the leftmost alias referre= d to in the *predicates* of left outer joins and rejects any alias to the r= ight of it for pushdown. So in this query the filter "t2.id=3D20" pushed d= own: > {noformat} > explain select * from t1 join t2 on (t1.id=3Dt2.id) left outer join t3 on= (t2.id=3Dt3.id) where t2.id=3D20; > {noformat} > while it isn't here: > {noformat} > explain select * from t1 join t2 on (t1.id=3Dt2.id) left outer join t3 on= (t1.id=3Dt3.id) where t2.id=3D20; > {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira