drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5551) `columns` changes meaning for CSV files depending on query
Date Mon, 29 May 2017 22:29:04 GMT
Paul Rogers created DRILL-5551:
----------------------------------

             Summary: `columns` changes meaning for CSV files depending on query
                 Key: DRILL-5551
                 URL: https://issues.apache.org/jira/browse/DRILL-5551
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Priority: Minor


Drill's CSV column reader supports two forms of files:

* Files with column headers as the first line of the file.
* Files without column headers.

The CSV storage plugin specifies which format to use for files accessed via that storage plugin
config.

Suppose we have a CSV file with headers:

{code}
a,b,c
10,foo,bar
{code}

Suppose we configure a storage plugin to use headers:

{code}
    TextFormatConfig csvFormat = new TextFormatConfig();
    csvFormat.fieldDelimiter = ',';
    csvFormat.skipFirstLine = false;
    csvFormat.extractHeader = true;
{code}

(The above can also be done using JSON when running Drill as a server.)

Suppose we execute this query:
{code}
SELECT columns FROM `dfs.data.example.csv`
{code}

The result is a single column, the special {{columns}} array, that contains all three fields.

Suppose we alter the query just a bit:
{code}
SELECT columns, a FROM `dfs.data.example.csv`
{code}

Now the result set is two non-nullable Varchar columns:

{code}
columns,a
,10
{code}

It seems that the meaning of `columns` shifts depending on whether the value appears by itself
or as part of a SELECT list.

Perhaps this handles the case of a file such as:

{code}
columns,values
a;b,10;10
c;d,20;30
{code}

That is fine. but what if I just wanted the first column:

{code}
SELECT columns FROM `dfs.data.strange.csv`
{code}

How would the code know if {{columns}} was the special column vs. the normal column called
"columns"?

Perhaps one long-term solution is to make {{columns}} into a table function (as has been proposed
for the implicit columns):

{code}
SELECT columns(t) FROM `dfs.data.strange.csv` AS t
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message