drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-1691) ConvertCountToDirectScan rule should be applicable for 2 or more COUNT aggregates
Date Wed, 12 Nov 2014 01:36:33 GMT
Aman Sinha created DRILL-1691:
---------------------------------

             Summary: ConvertCountToDirectScan rule should be applicable for 2 or more COUNT
aggregates
                 Key: DRILL-1691
                 URL: https://issues.apache.org/jira/browse/DRILL-1691
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 0.6.0
            Reporter: Aman Sinha
            Assignee: Aman Sinha


The ConvertCountToDirectScan rule currently only applies if there is a single COUNT(*) or
COUNT(column)  aggregate without group-by.   This rule should be extended to apply for multiple
such aggregates since the rule depends on the underlying ParquetGroupScan providing it the
correct column value count and retrieving that count for multiple columns should be fine.
 However, if even 1 such column does not have statistics, then we should not apply this rule.


Here's an  example sequence: 

First do a CTAS such that we ensure that statistics are present for the 
table (the original Parquet data may not have stats):
{code:sql}
0: jdbc:drill:zk=local> create table nation3 as select * from cp.`tpch/nation.parquet`;
+------------+---------------------------+
|  Fragment  | Number of records written |
+------------+---------------------------+
| 0_0        | 25                        |
+------------+---------------------------+
{code}

The Explain below shows the count is retrieved directly from the Scan: 
{code:sql}
0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x from nation3;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(x=[$0])
00-02        Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@5db6cb92])
{code}

The following query which does 2 aggregates causes the StreamAgg to be introduced in the plan
which is not needed:  
{code:sql}
0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x, count(n_nationkey)
as y from nation3;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(x=[$0], y=[$1])
00-02        StreamAgg(group=[{}], x=[COUNT($0)], y=[COUNT($1)])
00-03          Project(n_regionkey=[$1], n_nationkey=[$0])
00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/nation3]],
selectionRoot=/tmp/nation3, numFiles=1, columns=[`n_regionkey`, `n_nationkey`]]])
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message