drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deneche A. Hakim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3689) incorrect results : aggregate AVG returns wrong results over results returned by LEAD function.
Date Mon, 24 Aug 2015 15:38:45 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709463#comment-14709463
] 

Deneche A. Hakim commented on DRILL-3689:
-----------------------------------------

Looking at the output of the inner query:
{noformat}
select col7, col1, lead(col1) over(partition by col7 order by col1) leadcol1 from `3648.parquet`
order by col7, col1 limit 20;
+--------+----------------------+----------------------+
|  col7  |         col1         |       leadcol1       |
+--------+----------------------+----------------------+
| false  | -1                   | 1                    |
| false  | 1                    | 17                   |
| false  | 17                   | 30                   |
| false  | 30                   | 200                  |
| false  | 200                  | 1000                 |
| false  | 1000                 | 1001                 |
| false  | 1001                 | 5000                 |
| false  | 5000                 | 65534                |
| false  | 65534                | 4611686018427387903  |
| false  | 4611686018427387903  | 9223372036854775807  |
| false  | 9223372036854775807  | null                 |
| true   | -65535               | 0                    |
| true   | 0                    | 13                   |
| true   | 13                   | 23                   |
| true   | 23                   | 25                   |
| true   | 25                   | 197                  |
| true   | 197                  | 3000                 |
| true   | 3000                 | 9999999              |
| true   | 9999999              | 10000000             |
| true   | 10000000             | 92233720385475807    |
+--------+----------------------+----------------------+
{noformat}

Because col1 is stored as an INT64, trying to compute the average of {{lead(col1)}} will cause
an overflow. This explains why we have different results than Postgres.

[~khfaraaz] Can you confirm this by running the same query on a column with smaller numbers
?

> incorrect results : aggregate AVG returns wrong results over results returned by LEAD
function.
> -----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-3689
>                 URL: https://issues.apache.org/jira/browse/DRILL-3689
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.2.0
>         Environment:  private-branch https://github.com/adeneche/incubator-drill/tree/new-window-funcs
>            Reporter: Khurram Faraaz
>            Assignee: Deneche A. Hakim
>            Priority: Critical
>              Labels: window_function
>             Fix For: 1.2.0
>
>
> Aggregate AVG returns wrong results over results returned by LEAD function.
> results returned by Drill
> {code}
> 0: jdbc:drill:schema=dfs.tmp> SELECT  avg(lead_col1) FROM (SELECT LEAD(col1) OVER(PARTITION
BY col7 ORDER BY col1) lead_col1 , col7 FROM FEWRWSPQQ_101) sub_query;
> +-------------------------+
> |         EXPR$0          |
> +-------------------------+
> | 2.35195986941647008E17  |
> +-------------------------+
> 1 row selected (0.264 seconds)
> {code}
> Explain plan for above query from Drill
> {code}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for SELECT  avg(lead_col1) FROM (SELECT
LEAD(col1) OVER(PARTITION BY col7 ORDER BY col1) lead_col1 , col7 FROM FEWRWSPQQ_101) sub_query;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        Project(EXPR$0=[CAST(/(CastHigh(CASE(=($1, 0), null, $0)), $1)):ANY NOT
NULL])
> 00-03          StreamAgg(group=[{}], agg#0=[$SUM0($0)], agg#1=[COUNT($0)])
> 00-04            Project(w0$o0=[$3])
> 00-05              Window(window#0=[window(partition {2} order by [1] range between UNBOUNDED
PRECEDING and CURRENT ROW aggs [LEAD($1)])])
> 00-06                SelectionVectorRemover
> 00-07                  Sort(sort0=[$2], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-08                    Project(T36¦¦*=[$0], col1=[$1], col7=[$2])
> 00-09                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=maprfs:///tmp/FEWRWSPQQ_101]], selectionRoot=maprfs:/tmp/FEWRWSPQQ_101, numFiles=1,
columns=[`*`]]])
> {code}
> results returned by Postgres
> {code}
> postgres=# SELECT  avg(lead_col1) FROM (SELECT LEAD(col1) OVER(PARTITION BY col7 ORDER
BY col1) lead_col1 , col7 FROM FEWRWSPQQ_101) sub_query;
>          avg         
> ---------------------
>  1157533190627124568
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message