drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5292) Better Parquet handling of sparse columns
Date Fri, 24 Feb 2017 17:39:44 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883165#comment-15883165

Paul Rogers commented on DRILL-5292:

This is another instance of a general problem in Drill: that we do not have support for a
"Null" data type. In Drill, all nulls must be null of some type. By default, when Drill does
not know the type, we chose Int. This is fine only when the data eventually turns out to actually
be integer. Otherwise, conflicts occur.

The same issue arrises in JSON: one might have a long series of null values followed by a
non-null. In JSON, null is its own type: not "null integer" or "null string", just "null."
Again, Drill has to have "null of some type" so we guess integer, which may or may not be

Then, we need type conversion rules. A "Null vector" is compatible with any other type. So,
a vector of nulls can morph into a vector of strings or a vector of doubles once we see the

Such a solution still does not help the client, however. A client such as Tableau needs the
schema immediately. In this case for Parquet, or the suggested case for JSON, we don't know
the types until we read some amount of data. But, by then, Drill had to already predict the
future and tell the client what the type will eventually be. Since prediction is hard, there
is no good solution. Many workarounds have been proposed; this is another good suggestion.

> Better Parquet handling of sparse columns
> -----------------------------------------
>                 Key: DRILL-5292
>                 URL: https://issues.apache.org/jira/browse/DRILL-5292
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.10.0
>            Reporter: Nate Putnam
> It appears the current implantation of ParquetRecordReader will fill in missing columns
between files as a NullableIntVector. It would be better if the code could determine if that
column was defined in a different file (and didn't conflict) and use the defined data type.


This message was sent by Atlassian JIRA

View raw message