drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arina Ielchiieva (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-1691) ConvertCountToDirectScan rule should be applicable for 2 or more COUNT aggregates
Date Tue, 15 Aug 2017 15:08:01 GMT

     [ https://issues.apache.org/jira/browse/DRILL-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Arina Ielchiieva updated DRILL-1691:
------------------------------------
    Fix Version/s:     (was: Future)
                   1.12.0

> ConvertCountToDirectScan rule should be applicable for 2 or more COUNT aggregates
> ---------------------------------------------------------------------------------
>
>                 Key: DRILL-1691
>                 URL: https://issues.apache.org/jira/browse/DRILL-1691
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 0.6.0
>            Reporter: Aman Sinha
>            Assignee: Arina Ielchiieva
>             Fix For: 1.12.0
>
>
> The ConvertCountToDirectScan rule currently only applies if there is a single COUNT(*)
or COUNT(column)  aggregate without group-by.   This rule should be extended to apply for
multiple such aggregates since the rule depends on the underlying ParquetGroupScan providing
it the correct column value count and retrieving that count for multiple columns should be
fine.  However, if even 1 such column does not have statistics, then we should not apply this
rule. 
> Here's an  example sequence: 
> First do a CTAS such that we ensure that statistics are present for the 
> table (the original Parquet data may not have stats):
> {code:sql}
> 0: jdbc:drill:zk=local> create table nation3 as select * from cp.`tpch/nation.parquet`;
> +------------+---------------------------+
> |  Fragment  | Number of records written |
> +------------+---------------------------+
> | 0_0        | 25                        |
> +------------+---------------------------+
> {code}
> The Explain below shows the count is retrieved directly from the Scan: 
> {code:sql}
> 0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x from nation3;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(x=[$0])
> 00-02        Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@5db6cb92])
> {code}
> The following query which does 2 aggregates causes the StreamAgg to be introduced in
the plan which is not needed:  
> {code:sql}
> 0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x, count(n_nationkey)
as y from nation3;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(x=[$0], y=[$1])
> 00-02        StreamAgg(group=[{}], x=[COUNT($0)], y=[COUNT($1)])
> 00-03          Project(n_regionkey=[$1], n_nationkey=[$0])
> 00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/nation3]],
selectionRoot=/tmp/nation3, numFiles=1, columns=[`n_regionkey`, `n_nationkey`]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message