drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5550) SELECT non-existent column produces empty required VARCHAR
Date Thu, 14 Sep 2017 01:51:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165613#comment-16165613

Paul Rogers commented on DRILL-5550:

I suspect my original comment may have been a bit optimistic in assuming we can fix this issue.
Later discussion revealed additional considerations.

First, CSV does not support nulls. So, if the missing column does exist in at least one CSV
file, it will be non-null in that file. The closes CSV can come to a null column is a blank
column. So, the original code was probably right in saying that a missing column can be assumed
to be a blank Varchar.

Second, assume the query scans two files, one with (a, c) the other with (a, c, d). When reading
d from the second file, it will be a (required) Varchar. To avoid a schema change, d should
also be a required Varchar when reading the first file where d is missing.

As noted in the description, it would be best if Drill had the concept of a missing column.
But, Drill does not. So, we have to predict, in the first file, where d is missing, the type
of d in some later file. Per the above paragraph, we *can* predict that, if d ever does appear
in a CSV file, it will be a required Varchar.

This trick does not work, however, for JSON or Parquet or any file format that has more than
one type.

Since we cannot come up with a complete, working design for missing columns until we address
the missing column issue, perhaps it is better to just let this issue alone for now.

> SELECT non-existent column produces empty required VARCHAR
> ----------------------------------------------------------
>                 Key: DRILL-5550
>                 URL: https://issues.apache.org/jira/browse/DRILL-5550
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
> Drill's CSV column reader supports two forms of files:
> * Files with column headers as the first line of the file.
> * Files without column headers.
> The CSV storage plugin specifies which format to use for files accessed via that storage
plugin config.
> Suppose we have a CSV file with headers:
> {code}
> a,b,c
> 10,foo,bar
> {code}
> Suppose we configure a storage plugin to use headers:
> {code}
>     TextFormatConfig csvFormat = new TextFormatConfig();
>     csvFormat.fieldDelimiter = ',';
>     csvFormat.skipFirstLine = false;
>     csvFormat.extractHeader = true;
> {code}
> (The above can also be done using JSON when running Drill as a server.)
> Execute the following query:
> {code}
> SELECT a, c, d FROM `dfs.data.example.csv`
> {code}
> Results:
> {code}
> a,c,d
> 10,bar,
> {code}
> The actual type of column {{d}} is non-nullable VARCHAR.
> This is inconsistent with other parts of Drill in two ways, one may be a bug. Most other
parts of Drill use a nullable INT for "missing" columns.
> 1. For CSV it makes sense for the data type to be VARCHAR, since all CSV columns are
of that type.
> 2. It may *not* make sense for the column to be non-nullable and blank rather than nullable
and NULL. In SQL, NULL means that the data is unknown, which is the case here.
> In the future, we may want to use some other indication for a missing column. Until then,
the requested change is to make the type of a missing CSV column a nullable VARCHAR set to
value NULL.

This message was sent by Atlassian JIRA

View raw message