drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-3352) Extra re-distribution when evaluating window function after GROUP BY
Date Wed, 24 Jun 2015 06:11:43 GMT

     [ https://issues.apache.org/jira/browse/DRILL-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aman Sinha updated DRILL-3352:
------------------------------
    Labels: window_function  (was: )

> Extra re-distribution when evaluating window function after GROUP BY
> --------------------------------------------------------------------
>
>                 Key: DRILL-3352
>                 URL: https://issues.apache.org/jira/browse/DRILL-3352
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.0.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>              Labels: window_function
>             Fix For: 1.2.0
>
>
> Consider the following query and plan: 
> {code}
> explain plan for select min(l_partkey) over (partition by l_suppkey) from lineitem group
by l_partkey, l_suppkey limit 1;
> 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        SelectionVectorRemover
> 00-03          Limit(fetch=[1])
> 00-04            UnionExchange
> 01-01              Project($0=[$3])
> 01-02                Window(window#0=[window(partition {1} order by [] range between
UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [MIN($0)])])
> 01-03                  SelectionVectorRemover
> 01-04                    Sort(sort0=[$1], dir0=[ASC])
> 01-05                      Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
> 01-06                        HashToRandomExchange(dist0=[[$1]])
> 02-01                          UnorderedMuxExchange
> 03-01                            Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
> 03-02                              HashAgg(group=[{0, 1}], agg#0=[MIN($2)])
> 03-03                                Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
> 03-04                                  HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
> 04-01                                    UnorderedMuxExchange
> 05-01                                      Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2],
E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1, hash64AsDouble($0)))])
> 05-02                                        HashAgg(group=[{0, 1}], agg#0=[MIN($0)])
> 05-03                                          Project(l_partkey=[$1], l_suppkey=[$0])
> 05-04                                            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=file:/Users/asinha/data/tpch-sf1/lineitem]], selectionRoot=/Users/asinha/data/tpch-sf1/lineitem,
numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])
> {code}
> Here, we do a distribution for the HashAgg on 2 columns: {l_partkey, l_suppkey}.  Subsequently,
we re-distribute on {l_suppkey} only since the window function has a partition-by l_suppkey.
 The second re-distribute could be avoided if the first distribution for the HashAgg was done
on l_suppkey only.   The reason we do distribution on all grouping columns is to avoid skew
problems.   However, in many cases especially when a window function is involved, it may make
sense to only distribute on 1 column. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message