drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chun Chang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2309) 'null' is counted with subquery
Date Wed, 25 Feb 2015 21:26:05 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337236#comment-14337236
] 

Chun Chang commented on DRILL-2309:
-----------------------------------

I think this issue might be because we reuse count value. If an aggregate function already
generated count, it's possible we reuse it as an optimization. But it does not take into consideration
that a column might contain nulls and null should not be counted.

{code}
0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, count(tt.nul), avg(tt.fl)
from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by
tt.gbyi;
+------------+------------+------------+
|    gbyi    |   EXPR$1   |   EXPR$2   |
+------------+------------+------------+
| 0          | 66943      | 499613.0956877819 |
| 1          | 66318      | 500760.0252919893 |
| 2          | 66994      | 498091.82200273 |
| 3          | 66683      | 498696.5063226428 |
| 4          | 66638      | 501125.64656145993 |
| 5          | 66439      | 499961.32710397616 |
| 6          | 66911      | 498875.3923256599 |
| 7          | 66666      | 501093.43067788356 |
| 8          | 66479      | 498458.1044031481 |
| 9          | 66643      | 499967.5392457864 |
| 10         | 66787      | 499190.47462408233 |
| 11         | 66863      | 502095.86682194035 |
| 12         | 66647      | 501708.8141502653 |
| 13         | 66290      | 498896.453904129 |
| 14         | 66699      | 501487.4206955959 |
+------------+------------+------------+
15 rows selected (14.17 seconds)
0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, sum(tt.id), count(tt.nul)
from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by
tt.gbyi;
+------------+------------+------------+
|    gbyi    |   EXPR$1   |   EXPR$2   |
+------------+------------+------------+
| 0          | 33445554017 | 33580      |
| 1          | 33209358334 | 33317      |
| 2          | 33369118041 | 33438      |
| 3          | 33254533860 | 33535      |
| 4          | 33393965595 | 33369      |
| 5          | 33216885506 | 32990      |
| 6          | 33380205950 | 33661      |
| 7          | 33405849390 | 33130      |
| 8          | 33136951190 | 33362      |
| 9          | 33319291474 | 33364      |
| 10         | 33339388887 | 33229      |
| 11         | 33571590550 | 33567      |
| 12         | 33437342090 | 33379      |
| 13         | 33071800925 | 33045      |
| 14         | 33448664191 | 33305      |
+------------+------------+------------+
15 rows selected (13.909 seconds)
{code}

> 'null' is counted with subquery
> -------------------------------
>
>                 Key: DRILL-2309
>                 URL: https://issues.apache.org/jira/browse/DRILL-2309
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 0.8.0
>            Reporter: Chun Chang
>            Assignee: Daniel Barclay (Drill)
>            Priority: Critical
>
> #Thu Feb 19 18:40:10 EST 2015
> git.commit.id.abbrev=1ceddff
> The following query returns correct count involving columns that contains null value.
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, count(tt.nul) from
(select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
> +------------+------------+
> |    gbyi    |   EXPR$1   |
> +------------+------------+
> | 0          | 33580      |
> | 1          | 33317      |
> | 2          | 33438      |
> | 3          | 33535      |
> | 4          | 33369      |
> | 5          | 32990      |
> | 6          | 33661      |
> | 7          | 33130      |
> | 8          | 33362      |
> | 9          | 33364      |
> | 10         | 33229      |
> | 11         | 33567      |
> | 12         | 33379      |
> | 13         | 33045      |
> | 14         | 33305      |
> +------------+------------+
> {code}
> But if you add more aggregation to the query, the returned count is wrong (pay attention
to the last column). 
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, sum(tt.id), avg(tt.fl),
count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by tt.gbyi
order by tt.gbyi;
> +------------+------------+------------+------------+
> |    gbyi    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | 0          | 33445554017 | 499613.0956877819 | 66943      |
> | 1          | 33209358334 | 500760.0252919893 | 66318      |
> | 2          | 33369118041 | 498091.82200273 | 66994      |
> | 3          | 33254533860 | 498696.5063226428 | 66683      |
> | 4          | 33393965595 | 501125.64656145993 | 66638      |
> | 5          | 33216885506 | 499961.32710397616 | 66439      |
> | 6          | 33380205950 | 498875.3923256599 | 66911      |
> | 7          | 33405849390 | 501093.43067788356 | 66666      |
> | 8          | 33136951190 | 498458.1044031481 | 66479      |
> | 9          | 33319291474 | 499967.5392457864 | 66643      |
> | 10         | 33339388887 | 499190.47462408233 | 66787      |
> | 11         | 33571590550 | 502095.86682194035 | 66863      |
> | 12         | 33437342090 | 501708.8141502653 | 66647      |
> | 13         | 33071800925 | 498896.453904129 | 66290      |
> | 14         | 33448664191 | 501487.4206955959 | 66699      |
> +------------+------------+------------+------------+
> [code}
> plan for the query returned the wrong result:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select tt.gbyi, sum(tt.id),
avg(tt.fl), count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt
group by tt.gbyi order by tt.gbyi;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(gbyi=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
> 00-02        SingleMergeExchange(sort0=[0 ASC])
> 01-01          SelectionVectorRemover
> 01-02            Sort(sort0=[$0], dir0=[ASC])
> 01-03              Project(gbyi=[$0], EXPR$1=[CASE(=($2, 0), null, $1)], EXPR$2=[CAST(/(CastHigh(CASE(=($4,
0), null, $3)), $4)):ANY], EXPR$3=[$5])
> 01-04                HashAgg(group=[{0}], agg#0=[$SUM0($1)], agg#1=[$SUM0($2)], agg#2=[$SUM0($3)],
agg#3=[$SUM0($4)], EXPR$3=[$SUM0($5)])
> 01-05                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    HashAgg(group=[{0}], agg#0=[$SUM0($1)], agg#1=[COUNT($1)], agg#2=[$SUM0($2)],
agg#3=[COUNT($2)], EXPR$3=[COUNT()])
> 02-02                      Project(gbyi=[$3], id=[$2], fl=[$1], nul=[$0])
> 02-03                        Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json,
numFiles=1, columns=[`gbyi`, `id`, `fl`, `nul`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message