drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones
Date Thu, 04 Jan 2018 16:18:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311581#comment-16311581
] 

Paul Rogers commented on DRILL-5970:
------------------------------------

The proposed change is practical. The issues raised here and in DRILL-6035 are more at the
design level. Looks like we don't have the luxury of thinking about the design. So, we might
as well change the code and see what users think.

> DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead
of "REQUIRED" ones
> -------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5970
>                 URL: https://issues.apache.org/jira/browse/DRILL-5970
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen, Execution - Data Types, Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>
> The root cause of the issue is that adding REQUIRED (not-nullable) data types to the
container in the all MapWriters is not implemented.
> It can lead to get invalid schema. 
> {code}
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select
CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (2.376 seconds)
> {code}
> Run from Drill unit test framework (to see "data mode"):
> {code}
> @Test
>   public void test() throws Exception {
>     setColumnWidths(new int[] {25, 25});
>     List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from dfs.tmp.bof_repro_1");
>     printResult(queryDataBatches);
>   }
> 1 row(s):
> -------------------------------------------------------
> | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
> -------------------------------------------------------
> | ["hello","hai"]          | Bucket1                  |
> -------------------------------------------------------
> Total record count: 1
> {code}
> {code}
> vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
schema /tmp/bof_repro_1/0_0_0.parquet 
> message root {
>   repeated binary MYCOL (UTF8);
>   required binary Bucket (UTF8);
> }
> {code}
> To simulate of obtaining the wrong result you can try the query with aggregation by using
a new parquet reader (used by default for complex data types) and old parquet reader. False
"Hash aggregate does not support schema changes" error will happen. 
> 1) Create two parquet files.
> {code}
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select
CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (1.122 seconds)
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from (select
CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (0.552 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> {code}
> 2) Copy the parquet files from bof_repro_1 to bof_repro_2.
> {code}
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
> Found 1 items
> -rw-r--r--   3 mapr mapr        415 2017-07-25 11:46 /tmp/bof_repro_1/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
> Found 1 items
> -rw-r--r--   3 mapr mapr        368 2017-07-25 11:46 /tmp/bof_repro_2/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet /tmp/bof_repro_2/0_0_1.parquet
> [root@naravm1 ~]#
> {code}
> 3) Query the table.
> {code}
> 0: jdbc:drill:schema=dfs> ALTER SESSION SET  `planner.enable_streamagg`=false;
> +-------+------------------------------------+
> |  ok   |              summary               |
> +-------+------------------------------------+
> | true  | planner.enable_streamagg updated.  |
> +-------+------------------------------------+
> 1 row selected (0.124 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> +------------------+----------+
> |      MYCOL       |  Bucket  |
> +------------------+----------+
> | ["hello","hai"]  | Bucket1  |
> | null             | Bucket1  |
> +------------------+----------+
> 2 rows selected (0.247 seconds)
> 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 group by
bucket;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema changes
> Fragment 0:0
> [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] (state=,code=0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message