Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Sat, 10 Jan 2015 01:31:35 +0000 (UTC)
From: "Victoria Markman (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.12766610.1420853459000.52749.1420853495735@Atlassian.JIRA>
In-Reply-To: <JIRA.12766610.1420853459000@Atlassian.JIRA>
References: <JIRA.12766610.1420853459000@Atlassian.JIRA>
 <JIRA.12766610.1420853459302@arcas>
Subject: [jira] [Created] (DRILL-1977) Wrong result with aggregation on top
 of UNION ALL operator
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Victoria Markman created DRILL-1977:
---------------------------------------

             Summary: Wrong result with aggregation on top of UNION ALL operator
                 Key: DRILL-1977
                 URL: https://issues.apache.org/jira/browse/DRILL-1977
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 0.8.0
            Reporter: Victoria Markman
            Assignee: Jinfeng Ni


{code}
Jan 10 01:17:13 root@~ ] # hadoop fs -cat /test/t.json
{ "a1": 0, "b1": 0, "c1": "true",   "d1" : "2015-01-02"}
{ "a1": 0, "b1": 0, "c1": "false" , "d1" : "2015-01-03"}
{ "a1": 0, "b1": 0, "c1": "false" , "d1" : "2015-01-04"}
{ "a1": 1, "b1": 1, "c1": "true" ,  "d1" : "2015-01-05"}
{ "a1": 1, "b1": 1, "c1": "true" ,  "d1" : "2015-01-06"}
{code}

Query below returns wrong result, we should have 4 groups, instead we get only two groups from right side of the union.
Notice lack of project step above union all (unclear if it has anything to to with the bug)

{code}
0: jdbc:drill:schema=dfs> select calc1, max(a1), min(b1), count(c1) from (select a1+10 as calc1, a1, b1, c1 from `t.json` union all select a1+100 as calc1, a1, b1, c1 from `t.json`) group by calc1;
+------------+------------+------------+------------+
|   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
+------------+------------+------------+------------+
| 100        | 0          | 0          | 3          |
| 101        | 1          | 1          | 2          |
+------------+------------+------------+------------+
2 rows selected (0.181 seconds)

00-01      Project(calc1=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02        HashAgg(group=[{0}], EXPR$1=[MAX($1)], EXPR$2=[MIN($2)], EXPR$3=[COUNT($3)])
00-03          UnionAll(all=[true])
00-05            Project(calc1=[+($2, 10)], a1=[$2], b1=[$1], c1=[$0])
00-07              Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
00-04            Project(calc1=[+($2, 100)], a1=[$2], b1=[$1], c1=[$0])
00-06              Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
{code}

Correct result, when order of the columns is different in aggregate functions: "min(b1), max(a1)" instead of "max(a1), min(b1)"

{code}
0: jdbc:drill:schema=dfs> select calc1, min(b1), max(a1), count(c1) from (select a1+10 as calc1, a1, b1, c1 from `t.json` union all select a1+100 as calc1, a1, b1, c1 from `t.json`) group by calc1;
+------------+------------+------------+------------+
|   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
+------------+------------+------------+------------+
| 10         | 0          | 0          | 3          |
| 11         | 1          | 1          | 2          |
| 100        | 0          | 0          | 3          |
| 101        | 1          | 1          | 2          |
+------------+------------+------------+------------+
4 rows selected (0.256 seconds)

00-01      Project(calc1=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02        HashAgg(group=[{0}], EXPR$1=[MIN($1)], EXPR$2=[MAX($2)], EXPR$3=[COUNT($3)])
00-03          Project(calc1=[$0], b1=[$2], a1=[$1], c1=[$3])
00-04            UnionAll(all=[true])
00-06              Project(calc1=[+($2, 10)], a1=[$2], b1=[$1], c1=[$0])
00-08                Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
00-05              Project(calc1=[+($2, 100)], a1=[$2], b1=[$1], c1=[$0])
00-07                Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]]){code}

Non-union query works correctly:

{code}
0: jdbc:drill:schema=dfs> select a1+10 as calc1, min(a1), max(b1), count(c1) from `t.json` group by a1+10;
+------------+------------+------------+------------+
|   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
+------------+------------+------------+------------+
| 10         | 0          | 0          | 3          |
| 11         | 1          | 1          | 2          |
+------------+------------+------------+------------+
2 rows selected (0.142 seconds)
0: jdbc:drill:schema=dfs> select a1+10 as calc1, min(b1), max(a1), count(c1) from `t.json` group by a1+10;
+------------+------------+------------+------------+
|   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
+------------+------------+------------+------------+
| 10         | 0          | 0          | 3          |
| 11         | 1          | 1          | 2          |
+------------+------------+------------+------------+
2 rows selected (0.153 seconds)
{code}

If I switch legs of a UNION, we still get wrong result, but from the right side:

{code}
0: jdbc:drill:schema=dfs> select calc1, max(a1), min(b1), count(c1) from (select a1+100 as calc1, a1, b1, c1 from `t.json` union all select a1+10 as calc1, a1, b1, c1 from `t.json`) group by calc1;
+------------+------------+------------+------------+
|   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
+------------+------------+------------+------------+
| 10         | 0          | 0          | 3          |
| 11         | 1          | 1          | 2          |
+------------+------------+------------+------------+
2 rows selected (0.19 seconds)
{code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)