drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-1487) Drill window functions return wrong results
Date Fri, 03 Oct 2014 18:00:34 GMT

    [ https://issues.apache.org/jira/browse/DRILL-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158251#comment-14158251
] 

Jinfeng Ni commented on DRILL-1487:
-----------------------------------

Looks like the incorrect results issue is caused by missing the following rows that are equal
to the current row w.r.t the ORDER BY CLAUSE in the window frame. Here is the explanation
of the concept of window frame in posgresql: 
 
"There is another important concept associated with window functions: for each row, there
is a set of rows within its partition called its window frame. Many (but not all) window functions
act only on the rows of the window frame, rather than of the whole partition. By default,
if ORDER BY is supplied then the frame consists of all rows from the start of the partition
up through the current row, plus any following rows that are equal to the current row according
to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in
the partition."

If there is order by clause in the window definition, then, the window aggregate function
would be applied to the windows frame, which consists the row fro the start of the partition
up through the current row, plus any following rows that are equal to the current row w.r.t
ORDER BY clause.  In the original query, since ORDER BY is same as the PARTITION column, the
window aggregate function should be applied to all the rows in the partition, while the current
implementation seems to miss the rows of "that are equal to the current row according to the
ORDER BY clause.".

If re-modify the original query into the following query, we can clearly see that avg_sal
is computed over the start row to the current row, but ignore the following rows that are
equal to current row accord to the ORDER BY clause.  

{code}

SELECT employee_id,position_id, salary, avg(salary) OVER (PARTITION BY position_id order by
position_id) as avg_sal FROM cp.`employee.json` order by position_id limit 50;
+-------------+-------------+------------+------------+
| employee_id | position_id |   salary   |  avg_sal   |
+-------------+-------------+------------+------------+
| 1           | 1           | 80000.0    | 80000.0    |
| 5           | 2           | 35000.0    | 35000.0    |
| 2           | 2           | 40000.0    | 37500.0    |
| 4           | 2           | 40000.0    | 38333.333333333336 |
| 20          | 2           | 30000.0    | 36250.0    |
| 21          | 2           | 35000.0    | 36000.0    |
| 22          | 2           | 35000.0    | 35833.333333333336 |
| 6           | 3           | 25000.0    | 25000.0    |
| 7           | 4           | 15000.0    | 15000.0    |
| 10          | 5           | 50000.0    | 50000.0    |
| 40          | 6           | 10000.0    | 10000.0    |
| 39          | 6           | 10000.0    | 10000.0    |
| 37          | 6           | 6700.0     | 8900.0     |
| 38          | 6           | 8000.0     | 8675.0     |
| 42          | 7           | 5000.0     | 5000.0     |
| 36          | 7           | 45000.0    | 25000.0    |
| 41          | 7           | 8500.0     | 19500.0    |
| 43          | 8           | 6700.0     | 6700.0     |
| 44          | 8           | 5000.0     | 5850.0     |
| 47          | 9           | 6500.0     | 6500.0     |
| 48          | 9           | 7200.0     | 6850.0     |
| 49          | 9           | 5000.0     | 6233.333333333333 |
| 50          | 9           | 5000.0     | 5925.0     |
| 51          | 9           | 5000.0     | 5740.0     |
| 52          | 9           | 5000.0     | 5616.666666666667 |
| 45          | 9           | 6800.0     | 5785.714285714285 |
| 46          | 9           | 6600.0     | 5887.5     |

{code}

The fix seems to be to include the rows with equal ORDER BY columns in the window frame, when
compute window aggregate function.


> Drill window functions return wrong results
> -------------------------------------------
>
>                 Key: DRILL-1487
>                 URL: https://issues.apache.org/jira/browse/DRILL-1487
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Neeraja
>
> Executing the following window function with the requirement to see how a given employee
salary would compare to the avg(salary) in his/her position. 
> Query executes fine however returns wrong results(expect the avg(salary) to stay same
for a given window (i.e position id)
> 0: jdbc:drill:zk=local> SELECT employee_id,position_id, salary, avg(salary) OVER (PARTITION
BY position_id order by position_id) FROM cp.`employee.json` order by employee_id;
> +-------------+-------------+------------+------------+
> | employee_id | position_id |   salary   |   EXPR$3   |
> +-------------+-------------+------------+------------+
> | 1           | 1           | 80000.0    | 80000.0    |
> | 2           | 2           | 40000.0    | 37500.0    |
> | 4           | 2           | 40000.0    | 38333.333333333336 |
> | 5           | 2           | 35000.0    | 35000.0    |
> | 6           | 3           | 25000.0    | 25000.0    |
> | 7           | 4           | 15000.0    | 15000.0    |
> | 8           | 11          | 10000.0    | 14333.333333333334 |
> | 9           | 11          | 17000.0    | 17000.0    |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message