drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deneche A. Hakim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4469) SUM window query returns incorrect results over integer data
Date Mon, 07 Mar 2016 14:37:40 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183069#comment-15183069
] 

Deneche A. Hakim commented on DRILL-4469:
-----------------------------------------

Looking at the query plan:
{noformat}
00-00    Screen
00-01      Project(EXPR$0=[$0])
00-02        Project(w0$o0=[$2])
00-03          Window(window#0=[window(partition {0} order by [0] range between UNBOUNDED
PRECEDING and UNBOUNDED FOLLOWING aggs [SUM($1)])])
00-04            SelectionVectorRemover
00-05              Sort(sort0=[$0], sort1=[$0], dir0=[ASC], dir1=[ASC])
00-06                Project(T1¦¦*=[$0], $1=[ITEM($0, 'c1')])
00-07                  Project(T1¦¦*=[$0])
00-08                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/hakim/MapR/data/t_alltype.parquet]],
selectionRoot=file:/Users/hakim/MapR/data/t_alltype.parquet, numFiles=1, usedMetadataFile=false,
columns=[`*`]]])
{noformat}

Something's wrong, we are only expanding the 'c1' column and never expand 'c8 column. Also,
although sort is sorting using 2 column they both reference the same column 'T1||*'

For instance, the following query gives correct results, you will find it's plan right below:
{noformat}
SELECT SUM(c1) OVER(PARTITION BY c8 ORDER BY c1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) FROM (select * from `t_alltype.parquet`);
{noformat}

{noformat}
00-00    Screen
00-01      Project(EXPR$0=[$0])
00-02        Project($0=[$2])
00-03          Window(window#0=[window(partition {1} order by [0] range between UNBOUNDED
PRECEDING and UNBOUNDED FOLLOWING aggs [SUM($0)])])
00-04            SelectionVectorRemover
00-05              Sort(sort0=[$1], sort1=[$0], dir0=[ASC], dir1=[ASC])
00-06                Project($0=[ITEM($0, 'c1')], $1=[ITEM($0, 'c8')])
00-07                  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/hakim/MapR/data/t_alltype.parquet]],
selectionRoot=file:/Users/hakim/MapR/data/t_alltype.parquet, numFiles=1, usedMetadataFile=false,
columns=[`*`]]])
{noformat}

> SUM window query returns incorrect results over integer data
> ------------------------------------------------------------
>
>                 Key: DRILL-4469
>                 URL: https://issues.apache.org/jira/browse/DRILL-4469
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.6.0
>         Environment: 4 node CentOS cluster
>            Reporter: Khurram Faraaz
>            Priority: Critical
>              Labels: window_function
>         Attachments: t_alltype.csv, t_alltype.parquet
>
>
> SUM window query returns incorrect results as compared to Postgres, with or without the
frame clause in the window definition. Note that there is a sub query involved and data in
column c1 is sorted integer data with no nulls.
> Drill 1.6.0 commit ID: 6d5f4983
> Results from Drill 1.6.0
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> SELECT SUM(c1) OVER w FROM (select * from dfs.tmp.`t_alltype`)
subQry WINDOW w AS (PARTITION BY c8 ORDER BY c1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING);
> +---------+
> | EXPR$0  |
> +---------+
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> ...
> | 10585  |
> | 10585  |
> | 10585  |
> +--------+
> 145 rows selected (0.257 seconds)
> {noformat}
> results from Postgres 9.3
> {noformat}
> postgres=# SELECT SUM(c1) OVER w FROM (select * from t_alltype) subQry WINDOW w AS (PARTITION
BY c8 ORDER BY c1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
>  sum
> ------
>  4499
>  4499
>  4499
>  4499
>  4499
>  4499
> ...
>  5613
>  5613
>  5613
>   473
>   473
>   473
>   473
>   473
> (145 rows)
> {noformat}
> Removing the frame clause from window definition, still results in completely different
results on Postgres vs Drill
> Results from Drill 1.6.0
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp>    SELECT SUM(c1) OVER w FROM (select * from t_alltype)
subQry WINDOW w AS (PARTITION BY c8 ORDER BY c1);
> +---------+
> | EXPR$0  |
> +---------+
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> | 10585   |
> ...
> | 10585  |
> | 10585  |
> | 10585  |
> | 10585  |
> | 10585  |
> +--------+
> 145 rows selected (0.28 seconds)
> {noformat}
> Results from Postgres
> {noformat}
> postgres=# SELECT SUM(c1) OVER w FROM (select * from t_alltype) subQry WINDOW w AS (PARTITION
BY c8 ORDER BY c1);
>  sum
> ------
>     5
>    12
>    21
>    33
>    47
>    62
>    78
>    96
>   115
>   135
>   158
>   182
>   207
>   233
>   260
>   289
> ...
> 4914
>  5051
>  5189
>  5328
>  5470
>  5613
>     8
>    70
>   198
>   332
>   473
> (145 rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message