drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5551) `columns` changes meaning for CSV files depending on query
Date Mon, 29 May 2017 22:29:04 GMT
Paul Rogers created DRILL-5551:

             Summary: `columns` changes meaning for CSV files depending on query
                 Key: DRILL-5551
                 URL: https://issues.apache.org/jira/browse/DRILL-5551
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Priority: Minor

Drill's CSV column reader supports two forms of files:

* Files with column headers as the first line of the file.
* Files without column headers.

The CSV storage plugin specifies which format to use for files accessed via that storage plugin

Suppose we have a CSV file with headers:


Suppose we configure a storage plugin to use headers:

    TextFormatConfig csvFormat = new TextFormatConfig();
    csvFormat.fieldDelimiter = ',';
    csvFormat.skipFirstLine = false;
    csvFormat.extractHeader = true;

(The above can also be done using JSON when running Drill as a server.)

Suppose we execute this query:
SELECT columns FROM `dfs.data.example.csv`

The result is a single column, the special {{columns}} array, that contains all three fields.

Suppose we alter the query just a bit:
SELECT columns, a FROM `dfs.data.example.csv`

Now the result set is two non-nullable Varchar columns:


It seems that the meaning of `columns` shifts depending on whether the value appears by itself
or as part of a SELECT list.

Perhaps this handles the case of a file such as:


That is fine. but what if I just wanted the first column:

SELECT columns FROM `dfs.data.strange.csv`

How would the code know if {{columns}} was the special column vs. the normal column called

Perhaps one long-term solution is to make {{columns}} into a table function (as has been proposed
for the implicit columns):

SELECT columns(t) FROM `dfs.data.strange.csv` AS t

This message was sent by Atlassian JIRA

View raw message