drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4211) Column aliases not pushed down to JDBC stores in some cases when Drill expects aliased columns to be returned.
Date Fri, 11 Aug 2017 01:41:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122661#comment-16122661
] 

Paul Rogers commented on DRILL-4211:
------------------------------------

As we also discussed, Drill has a gap when handling ambiguous columns. At run time, Drill
retains only the column name with none of the additional information used at planning time.
That information includes:

_storage_plugin_ . _schema_ . _table_ . _column_

Consider the following queries:

{code}
SELECT `dfs.myDir`.`customers.csv`.col, `dfs.yourDir`.`customers.csv`.col
FROM `dfs.myDir`.`customers.csv`, `dfs.yourDir`.`customers.csv`
WHERE `dfs.myDir`.`customers.csv`.col = `dfs.yourDir`.`customers.csv`.col

SELECT `customers1.csv`.col, `customers2.csv`.col
FROM `customers1.csv`, `customers2.csv`
WHERE `customers1.csv`.col = `customers2.csv`.col

SELECT `dfs.myDir`.`customers.csv`.col, *
WHERE col = 10
{code}

(This assumes that Drill allows fully-qualfied names in the WHERE clause...)

In each case, the columns are unambiguous to a human. In the first, the fully qualified path
is unique. In the second, the works space is implied, but the table.column name is unique.
In the third, "col" will appear twice, but it is clear that the "col" in the WHERE clause
refers to the explicit "col", even if the wildcard expands to "col, foo, bar".

At run time, however, Drill uses only the tail column name. In fact, another JIRA asks that
dots in column names be treated as part of the name. (This allows a JSON field name of the
form "customers.col"...)

The most general solution would be to modify Drill's runtime naming system to provide fully
qualified names:

* The full plugin, workspace, table, column name, represented as a {{SchemaPath}}.
* A "hint" in a name that says that it came from a wildcard expansion rather than explicit.
* When names are ambiguous (and so we want to add a "0" to the name, prefer to add the suffix
to names that come from wildcard expansion.

There are probably other rules needed to ensure that all semantically valid references in
a query are translated into unique references at run time.

Note that, because Drill is schema-on-read, some of the name resolution normally done by the
planner (which considers fully qualified names) must be pushed into the execution framework
in Drill.

> Column aliases not pushed down to JDBC stores in some cases when Drill expects aliased
columns to be returned.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4211
>                 URL: https://issues.apache.org/jira/browse/DRILL-4211
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.3.0, 1.11.0
>         Environment: Postgres db storage
>            Reporter: Robert Hamilton-Smith
>            Assignee: Timothy Farkas
>              Labels: newbie
>
> When making an sql statement that incorporates a join to a table and then a self join
to that table to get a parent value , Drill brings back inconsistent results. 
> Here is the sql in postgres with correct output:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from transactions trx
> join categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join categories w1 on (cat.categoryparentguid = w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL;
> {code}
> Output:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|food&Dining|
> |id1|restaurants|food&Dining|
> |id2|Coffee Shops|food&Dining|
> |id2|Coffee Shops|food&Dining|
> When run in Drill with correct storage prefix:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from db.schema.transactions trx
> join db.schema.categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join db.schema.wpfm_categories w1 on (cat.categoryparentguid = w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL
> {code}
> Results are:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|null|
> |id1|restaurants|null|
> |id2|Coffee Shops|null|
> |id2|Coffee Shops|null|
> Physical plan is:
> {code:sql}
> 00-00    Screen : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname,
VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {110.0 rows, 110.0 cpu, 0.0 io,
0.0 network, 0.0 memory}, id = 64293
> 00-01      Project(categoryguid=[$0], categoryname=[$1], parentcat=[$2]) : rowType =
RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, VARCHAR(50) parentcat): rowcount
= 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
64292
> 00-02        Project(categoryguid=[$9], categoryname=[$41], parentcat=[$47]) : rowType
= RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, VARCHAR(50) parentcat): rowcount
= 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
64291
> 00-03          Jdbc(sql=[SELECT *
> FROM "public"."transactions"
> INNER JOIN (SELECT *
> FROM "public"."categories"
> WHERE "categoryparentguid" IS NOT NULL) AS "t" ON "transactions"."categoryguid" = "t"."categoryguid"
> INNER JOIN "public"."categories" AS "categories0" ON "t"."categoryparentguid" = "categories0"."categoryguid"])
: rowType = RecordType(VARCHAR(255) transactionguid, VARCHAR(255) relatedtransactionguid,
VARCHAR(255) transactioncode, DECIMAL(1, 0) transactionpending, VARCHAR(50) transactionrefobjecttype,
VARCHAR(255) transactionrefobjectguid, VARCHAR(1024) transactionrefobjectvalue, TIMESTAMP(6)
transactiondate, VARCHAR(256) transactiondescription, VARCHAR(50) categoryguid, VARCHAR(3)
transactioncurrency, DECIMAL(15, 3) transactionoldbalance, DECIMAL(13, 3) transactionamount,
DECIMAL(15, 3) transactionnewbalance, VARCHAR(512) transactionnotes, DECIMAL(2, 0) transactioninstrumenttype,
VARCHAR(20) transactioninstrumentsubtype, VARCHAR(20) transactioninstrumentcode, VARCHAR(50)
transactionorigpartyguid, VARCHAR(255) transactionorigaccountguid, VARCHAR(50) transactionrecpartyguid,
VARCHAR(255) transactionrecaccountguid, VARCHAR(256) transactionstatementdesc, DECIMAL(1,
0) transactionsplit, DECIMAL(1, 0) transactionduplicated, DECIMAL(1, 0) transactionrecategorized,
TIMESTAMP(6) transactioncreatedat, TIMESTAMP(6) transactionupdatedat, VARCHAR(50) transactionmatrulerefobjtype,
VARCHAR(50) transactionmatrulerefobjguid, VARCHAR(50) transactionmatrulerefobjvalue, VARCHAR(50)
transactionuserruleguid, DECIMAL(2, 0) transactionsplitorder, TIMESTAMP(6) transactionprocessedat,
TIMESTAMP(6) transactioncategoryassignat, VARCHAR(50) transactionsystemcategoryguid, VARCHAR(50)
transactionorigmandateid, VARCHAR(100) fingerprint, VARCHAR(50) categoryguid0, VARCHAR(50)
categoryparentguid, DECIMAL(3, 0) categorytype, VARCHAR(50) categoryname, VARCHAR(50) categorydescription,
VARCHAR(50) partyguid, VARCHAR(50) categoryguid1, VARCHAR(50) categoryparentguid0, DECIMAL(3,
0) categorytype0, VARCHAR(50) categoryname0, VARCHAR(50) categorydescription0, VARCHAR(50)
partyguid0): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 64259
> {code}
> I worked around it by creating a view on postgres but not ideal. Thanks in advance.
> First Drill Jira Bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message