drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khurram Faraaz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3648) NTILE function returns incorrect results
Date Fri, 14 Aug 2015 17:36:45 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697432#comment-14697432
] 

Khurram Faraaz commented on DRILL-3648:
---------------------------------------

Here is the repro with a smaller dataset.

Results returned by Drill.

{code}
0: jdbc:drill:schema=dfs.tmp> select col7 , ntile(5) over(partition by col7 order by col0)
NTILE_col7 from FEWRWSPQQ_101;
+--------+-------------+
|  col7  | NTILE_col7  |
+--------+-------------+
| false  | 1           |
| false  | 2           |
| false  | 3           |
| false  | 4           |
| false  | 5           |
| false  | 5           |
| false  | 5           |
| false  | 5           |
| false  | 5           |
| false  | 5           |
| false  | 5           |
| true   | 1           |
| true   | 2           |
| true   | 3           |
| true   | 4           |
| true   | 5           |
| true   | 5           |
| true   | 5           |
| true   | 5           |
| true   | 5           |
| true   | 5           |
| true   | 5           |
+--------+-------------+
22 rows selected (0.248 seconds)
{code}

Results returned by Postgres

{code}
postgres=# select col7 , ntile(5) over(partition by col7 order by col0) NTILE_col7 from FEWRWSPQQ_101;
 col7 | ntile_col7 
------+------------
 f    |          1
 f    |          1
 f    |          1
 f    |          2
 f    |          2
 f    |          3
 f    |          3
 f    |          4
 f    |          4
 f    |          5
 f    |          5
 t    |          1
 t    |          1
 t    |          1
 t    |          2
 t    |          2
 t    |          3
 t    |          3
 t    |          4
 t    |          4
 t    |          5
 t    |          5
(22 rows)
{code}

Interesting observation is that, when I use order by 1 in the window definition we see correct
results.

{code}
0: jdbc:drill:schema=dfs.tmp> select col7 , ntile(5) over(partition by col7 order by 1)
NTILE_col7 from FEWRWSPQQ_101;
+--------+-------------+
|  col7  | NTILE_col7  |
+--------+-------------+
| false  | 1           |
| false  | 1           |
| false  | 1           |
| false  | 2           |
| false  | 2           |
| false  | 3           |
| false  | 3           |
| false  | 4           |
| false  | 4           |
| false  | 5           |
| false  | 5           |
| true   | 1           |
| true   | 1           |
| true   | 1           |
| true   | 2           |
| true   | 2           |
| true   | 3           |
| true   | 3           |
| true   | 4           |
| true   | 4           |
| true   | 5           |
| true   | 5           |
+--------+-------------+
22 rows selected (0.277 seconds)
{code}

> NTILE function returns incorrect results
> ----------------------------------------
>
>                 Key: DRILL-3648
>                 URL: https://issues.apache.org/jira/browse/DRILL-3648
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.2.0
>         Environment: private-branch https://github.com/adeneche/incubator-drill/tree/new-window-funcs
>            Reporter: Khurram Faraaz
>            Assignee: Deneche A. Hakim
>            Priority: Critical
>              Labels: window_function
>             Fix For: 1.2.0
>
>
> NTILE function returns incorrect results for larger dataset. I am working on reproducing
the problem with a smaller dataset.
> The inner query that uses NTILE should have divided the rows into two sets (tiles)  where
each tile consists of (937088 + 1 ) rows , 937088 rows
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select ntile_key2, count(ntile_key2) from (select ntile(2)
over(partition by key2 order by key1) ntile_key2 from `twoKeyJsn.json` where key2 = 'm') group
by ntile_key2;
> +-------------+----------+
> | ntile_key2  |  EXPR$1  |
> +-------------+----------+
> | 1           | 1        |
> | 2           | 1874176  |
> +-------------+----------+
> 2 rows selected (49.406 seconds)
> {code}
> Explain plan for  inner query that returns wrong results.
> {code}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select ntile(2) over(partition by key2
order by key1) from `twoKeyJsn.json` where key2 = 'm';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(EXPR$0=[$0])
> 01-02          Project($0=[$2])
> 01-03            Window(window#0=[window(partition {0} order by [1] range between UNBOUNDED
PRECEDING and CURRENT ROW aggs [NTILE($2)])])
> 01-04              SelectionVectorRemover
> 01-05                Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 01-06                  Project(key2=[$0], key1=[$1])
> 01-07                    HashToRandomExchange(dist0=[[$0]])
> 02-01                      UnorderedMuxExchange
> 03-01                        Project(key2=[$0], key1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02                          SelectionVectorRemover
> 03-03                            Filter(condition=[=($0, 'm')])
> 03-04                              Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/tmp/twoKeyJsn.json,
numFiles=1, columns=[`key2`, `key1`], files=[maprfs:///tmp/twoKeyJsn.json]]])
> {code}
> Total number of rows in partition that has key2 = 'm'
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select count(key1) from `twoKeyJsn.json` where key2
= 'm';
> +----------+
> |  EXPR$0  |
> +----------+
> | 1874177  |
> +----------+
> 1 row selected (37.581 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message