drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laurent Goujon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4468) Aggregates over empty input might fail
Date Fri, 04 Mar 2016 18:05:40 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180277#comment-15180277
] 

Laurent Goujon commented on DRILL-4468:
---------------------------------------

I looked more into this with the help of [~jnadeau]. The issue is that ExpressionTreeMaterializer
tries to find the best function based on the arguments ( which is NullExpression.INSTANCE),
and the best one (on a cast cost-basis) is the one taking VARCHAR.

It sounds fishy to me that an invalid function can be considered, this might be something
which should be revisited?

> Aggregates over empty input might fail
> --------------------------------------
>
>                 Key: DRILL-4468
>                 URL: https://issues.apache.org/jira/browse/DRILL-4468
>             Project: Apache Drill
>          Issue Type: Bug
>         Environment: Linux/OpenJDK 7
>            Reporter: Laurent Goujon
>
> Some aggregation queries over empty input might fail, depending of the column ordering.
> This query for example would fail:
> {noformat}
> select sum(int_col) col1, sum(bigint_col) col2 from cp.`employee.json` where 1 = 0
> org.apache.drill.common.exceptions.UserRemoteException: UNSUPPORTED_OPERATION ERROR:
Only COUNT, MIN and MAX aggregate functions supported for VarChar type Fragment 0:0 [Error
Id: dcef042c-1c53-40df-88b0-816d3cb109a7 on xxx:31010] 
> {noformat}
> But this one would succeed:
> {noformat}
> select sum(bigint_col) col2, sum(int_col) col1 from cp.`employee.json` where 1 = 0
> null    null
> {noformat}
> The reason for why only one query fails is because of DRILL-4467. The consequence is
that the plans are significantly different, and don't behave quite the same way.
> Here's the Physical plan for the first query:
> {noformat}
> 00-00    Screen : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative
cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 339
> 00-01      Project(col1=[$0], col2=[$1]) : rowType = RecordType(ANY col1, ANY col2):
rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 338
> 00-02        StreamAgg(group=[{}], col1=[SUM($0)], col2=[SUM($1)]) : rowType = RecordType(ANY
col1, ANY col2): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 337
> 00-03          Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY int_col, ANY bigint_col):
rowcount = 1.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 336
> 00-04            Scan(groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json,
numFiles=1, columns=[`int_col`, `bigint_col`], files=[classpath:/employee.json]]]) : rowType
= RecordType(ANY int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows,
926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 335
> {noformat} 
> and the physical plan for the second query:
> {noformat}
> 00-00    Screen : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative
cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 775
> 00-01      Project(col2=[$0], col1=[$1]) : rowType = RecordType(ANY col2, ANY col1):
rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 774
> 00-02        StreamAgg(group=[{}], col2=[SUM($0)], col1=[SUM($1)]) : rowType = RecordType(ANY
col2, ANY col1): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 773
> 00-03          Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY bigint_col, ANY
int_col): rowcount = 1.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 772
> 00-04            Project(bigint_col=[$1], int_col=[$0]) : rowType = RecordType(ANY bigint_col,
ANY int_col): rowcount = 463.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 771
> 00-05              Scan(groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json,
numFiles=1, columns=[`bigint_col`, `int_col`], files=[classpath:/employee.json]]]) : rowType
= RecordType(ANY int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows,
926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 770
> {noformat}
> The extra projection just before the scan seems to hide the VARCHAR type of the columns,
and allow for aggregation to succeed. On the other hand, the storage plugin allows for column
push down, so the projection is theoretically unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message