drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vitalii Diravka (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-4614) Drill must appoint one data type per one column for self-describing data while querying directories
Date Mon, 18 Apr 2016 16:50:26 GMT
Vitalii Diravka created DRILL-4614:
--------------------------------------

             Summary: Drill must appoint one data type per one column for self-describing
data while querying directories 
                 Key: DRILL-4614
                 URL: https://issues.apache.org/jira/browse/DRILL-4614
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Data Types
    Affects Versions: 1.6.0
            Reporter: Vitalii Diravka
            Assignee: Vitalii Diravka
             Fix For: 1.7.0


While drill selects data from the directory and detects data types on-the-fly
it is possible that one field will be of several data types . 

For example:

1. Create an input file as follows
20K rows with the following - 
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
200 rows with the following - 
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"}}

2. CTAS as follows
{code:sql}
CREATE TABLE dfs.`tmp`.`tp` as select * from dfs.`data.json` t
{code}

In this case will be created parquet table as the folder with two files.

3. Select the data
{code}
select t.others.additional from dfs.`tmp`.`tp` t
{code}
The result of selecting will be mix of EXPR$0<INT(OPTIONAL)>  and  EXPR$0<VARCHAR(OPTIONAL)>.

It happens because Drill defines column data type per file.  
The same result with json files.
Since streaming aggregate does not support schema changes this issue makes impossible of using
aggregate functions with query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message