drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek Girish <abhishek.gir...@gmail.com>
Subject Drill should validate column names within window functions
Date Tue, 23 Jun 2015 19:57:48 GMT
Hey all,

I observed an issue while working with Window Functions. I observed a case
where wrong results are returned from Drill.

In-case of weak schema such as parquet, Drill does not validate column
names. It is understandable when only part of the projection list in a
query. But when part of a Window Function, the results displayed are wrong,
and at times hard to identify the cause.

Two examples below:

> SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk
ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
+---------+
| EXPR$0  |
+---------+
| 0.0     |
| 0.0     |
+---------+
2 rows selected (7.116 seconds)

SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY s.ss_stoe_sk,
s.s_customr_sk) FROM store_sales s LIMIT 2;
+---------+
| EXPR$0  |
+---------+
| 1.0     |
| 1.0     |
+---------+
2 rows selected (8.361 seconds)

In both cases above, some columns do not exist.

With normal aggregate functions, it is similar to having a non-existent
column in projection list. Drill prints a column of null rows. This could
still be documented for users to expect "null" columns in results when
non-existent columns are part of a projection list.

> SELECT s.ss_store_sk, avg (ssdfd), ssdfd FROM store_sales s GROUP BY
s.ss_store_sk, ssdfd LIMIT 2;
+--------------+---------+--------+
| ss_store_sk  | EXPR$1  | ssdfd  |
+--------------+---------+--------+
| 10           | null    | null   |
| 4            | null    | null   |
+--------------+---------+--------+
2 rows selected (1.252 seconds)

But in case of window functions (and maybe other functions & expressions),
the results might look more real and hence difficult to identify that the
query had typos. Worse, users may trust the data returned from Drill, which
they shouldn't have.

Postgres:

# SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY
s.ss_store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
      cume_dist
----------------------
 3.06415464350749e-05
 3.06415464350749e-05
(2 rows)

# SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk
ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;

ERROR:  column s.store_sk does not exist

LINE 1: ...ARTITION BY s.store_sk, s.ss_customer_sk ORDER BY s.store_sk...
                                                           ^

I think Drill at minimum should throw a warning message when it encounters
a non-existent column. And ideally queries must fail when non-existent
columns are part of any function/expression.

I'll file a JIRA if it is agreed to be an issue.

Regards,
Abhishek

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message