impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "fengYu (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IMPALA-5016) Missed opportunities for static partition pruning with COALESCE()
Date Tue, 13 Jun 2017 01:54:00 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

fengYu resolved IMPALA-5016.
----------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

IMPALA-5016: Simplify COALESCE() in SimplifyConditionalsRule.

Simplify COALESCE by skipping leading nulls and applying the following
transformations:
COALESCE(null, a, b) -> COALESCE(a, b);
COALESCE(<literal>, a, b) -> <literal>, when literal is not NullLiteral;
COALESCE(<partition-slotref>, a, b) -> <partition-slotref>,
when the partition column does not contain NULL.

Testing:
added unit tests in ExprRewriteRulesTest

Change-Id: I0325a9e437261b15313bbdf2f22d83376a84b575
Reviewed-on: http://gerrit.cloudera.org:8080/7015
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins

> Missed opportunities for static partition pruning with COALESCE()
> -----------------------------------------------------------------
>
>                 Key: IMPALA-5016
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5016
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
>            Reporter: William Forson
>            Assignee: fengYu
>              Labels: performance
>             Fix For: Impala 2.10.0
>
>
> TBH I'm not sure how general this issue is, so rather than attempting to express it as
abstractly as possible, I'll just start with a specific example (apologies in advance for
formatting...horrible JIRA artist here):
> {code}
> # create two tables -- the main point here is that t1 has 2 partitions
> > create table t1 (i int) partitioned by (part int) ;
> > create table t2 (i int) partitioned by (part int) ;
> > insert into t1 (part, i) values (1, 1), (2, 2) ;
> > insert into t2 (part, i) values (1, 1), (2, 2) ;
> # query 1: coalesce t1 partition column with literal value — partition pruning kicks
in
> > explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part, 666)
as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where part = 1;
> Query: explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part,
666) as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where part
= 1
> +------------------------------------------------------------------------------------+
> | Explain String                                                                    
|
> +------------------------------------------------------------------------------------+
> <SNIP>
> | 00:SCAN HDFS [default.t1]                                                         
|
> |    partitions=1/2 files=1 size=2B                                                 
|
> +------------------------------------------------------------------------------------+
> # ^^^note # partitions scanned^^^
> # query 2: coalesce t1 partition column against dynamic value — partition pruning does
NOT kick in
> > explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part, t2.part)
as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where part = 1;
> Query: explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part,
t2.part) as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where
part = 1
> +------------------------------------------------------------------------------------+
> | Explain String                                                                    
|
> +------------------------------------------------------------------------------------+
> <SNIP>
> | 00:SCAN HDFS [default.t1]                                                         
|
> |    partitions=2/2 files=2 size=4B                                                 
|
> +------------------------------------------------------------------------------------+
> # ^^^note # partitions scanned^^^
> {code}
> In both of those queries, we're applying a filter predicate to a column that is defined
as follows:
> coalesce(t1.part, _)
> Since t1.part is a partition column in t1, every row in t1 has a non-null value for t1.part.
Furthermore, because t1.part appears as the first parameter to coalesce, every row in the
result set that contains _any_ data from t1 will get its "part" value from t1. Thus, all t1
data in the result set will be subject to the filter predicate in t1.part -- i.e. all t1 partitions
other than (part=1) could have been pruned during query planning.
> Question: what is wrong with the above reasoning? am I missing something fundamental/obvious
here?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message