drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-2242) Wrong result (more rows) when outer query groups by subset of columns that inner query groups by
Date Tue, 17 Feb 2015 22:02:11 GMT

     [ https://issues.apache.org/jira/browse/DRILL-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aman Sinha updated DRILL-2242:
------------------------------
    Assignee: Jinfeng Ni  (was: Aman Sinha)

> Wrong result (more rows) when outer query groups by subset of columns that inner query
groups by
> ------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2242
>                 URL: https://issues.apache.org/jira/browse/DRILL-2242
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0
>            Reporter: Aman Sinha
>            Assignee: Jinfeng Ni
>            Priority: Critical
>
> The following query has a subquery that groups on 2 columns and outer query that queries
on 1 of those columns.  With slice_target = 1 to force exchanges, it produces incorrect result:

> {code}
> alter session set `planner.slice_target` = 1;
> select count(*) from 
>  (select l_partksy from
>    (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
>       group by l_partkey, l_suppkey) 
>    group by l_partkey
>  );
> +------------+
> |   EXPR$0   |
> +------------+
> | 6227       |
> +------------+
> 1 row selected (1.522 seconds)
> {code}
> Correct result (from Postgres): 
> {code}
>  count
> -------
>   2000
> (1 row)
> {code}
> The cause appears to be related to distribution trait propagation.  Here's the EXPLAIN
plan: 
> {code}
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
> 00-02        UnionExchange
> 01-01          StreamAgg(group=[{}], EXPR$0=[COUNT()])
> 01-02            Project($f0=[0])
> 01-03              HashAgg(group=[{0}])
> 01-04                Project(l_partkey=[$0])
> 01-05                  HashAgg(group=[{0, 1}])
> 01-06                    HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
> 02-01                      HashAgg(group=[{0, 1}])
> 02-02                        Project(l_partkey=[$1], l_suppkey=[$0])
> 02-03                          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/tpch/lineitem.parquet]], selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`,
`l_suppkey`]]])
> {code}
> Note that the HashExchange operator 06 does a distribute on 2 columns l_partkey and l_suppkey
in order to perform the 2phase aggregation. These are the group-by columns.  However, in the
outer query's HashAgg, there is no re-distribution being done.  It assumes that data is already
hash distributed on l_partkey which is not correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message