Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Thu, 1 Sep 2011 00:11:10 +0000 (UTC)
From: "Charles Chen (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: 
 <339829560.5167.1314835870530.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <483778526.3520.1312334187341.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HIVE-2337) Predicate pushdown erroneously
 conservative with outer joins
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HIVE-2337?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13095=
015#comment-13095015 ]=20

Charles Chen commented on HIVE-2337:
------------------------------------

I've fixed the test output--it seems to be an improvement.

> Predicate pushdown erroneously conservative with outer joins
> ------------------------------------------------------------
>
>                 Key: HIVE-2337
>                 URL: https://issues.apache.org/jira/browse/HIVE-2337
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Charles Chen
>            Assignee: Charles Chen
>         Attachments: HIVE-2337v1.patch, HIVE-2337v2.patch, HIVE-2337v3.pa=
tch, HIVE-2337v4.patch
>
>
> The predicate pushdown filter is not applying left associativity of joins=
 correctly in determining possible aliases for pushing predicates.
> In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria fo=
r pushing aliases is specified as:
> {noformat}
>     /**
>      * Figures out the aliases for whom it is safe to push predicates bas=
ed on
>      * ANSI SQL semantics For inner join, all predicates for all aliases =
can be
>      * pushed For full outer join, none of the predicates can be pushed a=
s that
>      * would limit the number of rows for join For left outer join, all t=
he
>      * predicates on the left side aliases can be pushed up For right out=
er
>      * join, all the predicates on the right side aliases can be pushed u=
p Joins
>      * chain containing both left and right outer joins are treated as fu=
ll
>      * outer join. [...]
>      *
>      * @param op
>      *          Join Operator
>      * @param rr
>      *          Row resolver
>      * @return set of qualified aliases
>      */
> {noformat}
> Since hive joins are left associative, something like "a RIGHT OUTER JOIN=
 b LEFT OUTER JOIN c=C2=A0INNER JOIN d" should be interpreted as "((a RIGHT=
 OUTER JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases wh=
ere joins with both left and right outer joins can have aliases that can be=
 pushed.  Here, aliases b and d are eligible to be pushed up while the curr=
ent criteria provide that none are eligible.
> Using:
> {noformat}
> create table t1 (id int, key string, value string);
> create table t2 (id int, key string, value string);
> create table t3 (id int, key string, value string);
> create table t4 (id int, key string, value string);
> {noformat}
> For example, the query
> {noformat}
> explain select * from t1 full outer join t2 on t1.id=3Dt2.id join t3 on t=
2.id=3Dt3.id where t3.id=3D20;=20
> {noformat}
> currently gives
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1=20
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2=20
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3=20
>           TableScan
>             alias: t3
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 2
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _c=
ol10, _col11, _col12
>           Filter Operator
>             predicate:
>                 expr: (_col10 =3D 20)
>                 type: boolean
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col1
>                     type: string
>                     expr: _col2
>                     type: string
>                     expr: _col5
>                     type: int
>                     expr: _col6
>                     type: string
>                     expr: _col7
>                     type: string
>                     expr: _col10
>                     type: int
>                     expr: _col11
>                     type: string
>                     expr: _col12
>                     type: string
>               outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5=
, _col6, _col7, _col8
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputForma=
t
>                     output format: org.apache.hadoop.hive.ql.io.HiveIgnor=
eKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> while the correct behavior is to push the filter "t3.id=3D20" down:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1=20
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2=20
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3=20
>           TableScan
>             alias: t3
>             Filter Operator
>               predicate:
>                   expr: (id =3D 20)
>                   type: boolean
>               Reduce Output Operator
>                 key expressions:
>                       expr: id
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: id
>                       type: int
>                 tag: 2
>                 value expressions:
>                       expr: id
>                       type: int
>                       expr: key
>                       type: string
>                       expr: value
>                       type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _c=
ol10, _col11, _col12
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: int
>                   expr: _col1
>                   type: string
>                   expr: _col2
>                   type: string
>                   expr: _col5
>                   type: int
>                   expr: _col6
>                   type: string
>                   expr: _col7
>                   type: string
>                   expr: _col10
>                   type: int
>                   expr: _col11
>                   type: string
>                   expr: _col12
>                   type: string
>             outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, =
_col6, _col7, _col8
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreK=
eyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> The current behavior is actually stranger than this: for a left outer joi=
n (similarly for a right outer join), hive finds the leftmost alias referre=
d to in the *predicates* of left outer joins and rejects any alias to the r=
ight of it for pushdown.  So in this query the filter "t2.id=3D20" pushed d=
own:
> {noformat}
> explain select * from t1 join t2 on (t1.id=3Dt2.id) left outer join t3 on=
 (t2.id=3Dt3.id) where t2.id=3D20;
> {noformat}
> while it isn't here:
> {noformat}
> explain select * from t1 join t2 on (t1.id=3Dt2.id) left outer join t3 on=
 (t1.id=3Dt3.id) where t2.id=3D20;
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira