Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Thu, 2 Jul 2015 14:21:04 +0000 (UTC)
From: "Aman Sinha (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.12775062.1423869532000.85149.1435846864686@Atlassian.JIRA>
In-Reply-To: <JIRA.12775062.1423869532000@Atlassian.JIRA>
References: <JIRA.12775062.1423869532000@Atlassian.JIRA>
 <JIRA.12775062.1423869532031@arcas>
Subject: [jira] [Commented] (DRILL-2247) SUM with CASE statement on the
 column of the null producing side of left outer join returns wrong result
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/DRILL-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612007#comment-14612007 ] 

Aman Sinha commented on DRILL-2247:
-----------------------------------

There are a few components of this query that make it difficult to analyze.  The first thing would be to do the left outer join and aggregate functions on an integer/bigint column instead of floating point.   You can still use the round() function since the result of avg() or the division would be floating point.   Doing joins on floating point numbers is going to be imprecise (decimal is fine..when we support it). 

> SUM with CASE statement on the column of the null producing side of left outer join returns wrong result
> --------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2247
>                 URL: https://issues.apache.org/jira/browse/DRILL-2247
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.8.0
>            Reporter: Victoria Markman
>            Assignee: Jinfeng Ni
>             Fix For: 1.2.0
>
>         Attachments: x1.parquet, x2.parquet
>
>
> {code}
> 0: jdbc:drill:schema=dfs> select * from x1;
> +------------+
> |     c1     |
> +------------+
> | 1032.6516  |
> +------------+
> 1 row selected (0.093 seconds)
> 0: jdbc:drill:schema=dfs> select * from x2;
> +------------+
> |     c1     |
> +------------+
> | 1057.3117  |
> | 1090.8299  |
> +------------+
> 2 rows selected (0.085 seconds)
> {code}
> Correct result is NULL:
> {code}
> 0: jdbc:drill:schema=dfs> select
> . . . . . . . . . . . . >         sum(x2.c1)                                                 as simple_sum,
> . . . . . . . . . . . . >         sum(case when x2.c1 = 0 then 100 else round(x2.c1/12) end) as sum_with_case
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . >         x1 left outer join x2 on x1.c1 = x2.c1
> . . . . . . . . . . . . > group by
> . . . . . . . . . . . . >         x1.c1;
> +------------+---------------+
> | simple_sum | sum_with_case |
> +------------+---------------+
> | 0.0000     | null          |
> +------------+---------------+
> 1 row selected (0.173 seconds)
> {code}
> Query plan for correct result:
> {code}
> 00-01      Project(simple_sum=[$0], sum_with_case=[$1])
> 00-02        UnionExchange
> 01-01          Project(simple_sum=[$1], sum_with_case=[$2])
> 01-02            HashAgg(group=[{0}], simple_sum=[SUM($1)], sum_with_case=[SUM($2)])
> 01-03              Project(c1=[$0], c10=[$1], $f2=[CASE(=($1, 0), CAST(100):ANY NOT NULL, ROUND(/($1, 12)))])
> 01-04                HashJoin(condition=[=($0, $1)], joinType=[left])
> 01-06                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/aggregation/x1]], selectionRoot=/aggregation/x1, numFiles=1, columns=[`c1`]]])
> 01-05                  Project(c10=[$0])
> 01-07                    HashToRandomExchange(dist0=[[$0]])
> 03-01                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/aggregation/x2]], selectionRoot=/aggregation/x2, numFiles=1, columns=[`c1`]]])
> {code}
> Wrong result: sum with case returns 0.0 instead of NULL
> {code}
> 0: jdbc:drill:schema=dfs> select
> . . . . . . . . . . . . >         count(*)                                                   as count_star,
> . . . . . . . . . . . . >         sum(x2.c1)                                                 as simle_sum,
> . . . . . . . . . . . . >         round(avg(x2.c1))                                          as round_avg_x2_c1,
> . . . . . . . . . . . . >         sum(case when x2.c1 = 0 then 100 else round(x2.c1/12) end) as sum_with_case
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . >         x1 left outer join x2 on x1.c1 = x2.c1
> . . . . . . . . . . . . > group by
> . . . . . . . . . . . . >         x1.c1;
> +------------+------------+-----------------+---------------+
> | count_star | simle_sum  | round_avg_x2_c1 | sum_with_case |
> +------------+------------+-----------------+---------------+
> | 1          | null       | null            | 0.0           |
> +------------+------------+-----------------+---------------+
> {code}
> Query plan for the wrong result query:
> {code}
> 00-01      Project(count_star=[$0], simle_sum=[$1], round_avg_x2_c1=[$2], sum_with_case=[$3])
> 00-02        UnionExchange
> 01-01          Project(count_star=[$1], simle_sum=[CASE(=($3, 0), null, $2)], round_avg_x2_c1=[ROUND(CAST(/(CastHigh(CASE(=($3, 0), null, $2)), $3)):ANY)], sum_with_case=[$4])
> 01-02            HashAgg(group=[{0}], count_star=[COUNT()], agg#1=[$SUM0($1)], agg#2=[COUNT($1)], agg#3=[$SUM0($2)])
> 01-03              Project(c1=[$0], c10=[$1], $f2=[CASE(=($1, 0), CAST(100):ANY NOT NULL, ROUND(/($1, 12)))])
> 01-04                HashJoin(condition=[=($0, $1)], joinType=[left])
> 01-06                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/aggregation/x1]], selectionRoot=/aggregation/x1, numFiles=1, columns=[`c1`]]])
> 01-05                  Project(c10=[$0])
> 01-07                    HashToRandomExchange(dist0=[[$0]])
> 03-01                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/aggregation/x2]], selectionRoot=/aggregation/x2, numFiles=1, columns=[`c1`]]])
> {code}
> Cut/paste version of the query, comment out round function, query will return correct result.
> {code:sql}
> select
>         count(*)     as count_star,                                
>         sum(x2.c1)   as simle_sum,                                 
>         round(avg(x2.c1))  as round_avg_x2_c1,                     
>         sum(case when x2.c1 = 0 then 100 else round(x2.c1/12) end) as sum_with_case
> from
>         x1 left outer join x2 on x1.c1 = x2.c1
> group by
>         x1.c1;
> {code}
> Not sure about how critical this bug is. Inconsistency is bad.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)