drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones
Date Tue, 19 Dec 2017 00:35:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295925#comment-16295925
] 

Paul Rogers commented on DRILL-5970:
------------------------------------

The second issue is how Drill handles complex Parquet schemas:

{noformat}
optional group complex_field (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (UTF8);
      required int64 value;
    }
  }
{noformat}

This is "an optional array of maps of required scalars." That is, in JSON:

{code}
{a: [ {b: 10}, {b: 20} ] }
{a: null}
{code}

As explained, Drill implements this as an array of maps with required fields. Drill treats
the null arrays as empty arrays:

{code}
{a: [ {b: 10}, {b: 20} ] }
{a: [ ] }
{code}

This should work today. The question is, do the Parquet column writers implement this?

> DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead
of "REQUIRED" ones
> -------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5970
>                 URL: https://issues.apache.org/jira/browse/DRILL-5970
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen, Execution - Data Types, Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>
> The root cause of the issue is that adding REQUIRED (not-nullable) data types to the
container in the all MapWriters is not implemented.
> It can lead to get invalid schema. 
> {code}
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select
CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (2.376 seconds)
> {code}
> Run from Drill unit test framework (to see "data mode"):
> {code}
> @Test
>   public void test() throws Exception {
>     setColumnWidths(new int[] {25, 25});
>     List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from dfs.tmp.bof_repro_1");
>     printResult(queryDataBatches);
>   }
> 1 row(s):
> -------------------------------------------------------
> | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
> -------------------------------------------------------
> | ["hello","hai"]          | Bucket1                  |
> -------------------------------------------------------
> Total record count: 1
> {code}
> {code}
> vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
schema /tmp/bof_repro_1/0_0_0.parquet 
> message root {
>   repeated binary MYCOL (UTF8);
>   required binary Bucket (UTF8);
> }
> {code}
> To simulate of obtaining the wrong result you can try the query with aggregation by using
a new parquet reader (used by default for complex data types) and old parquet reader. False
"Hash aggregate does not support schema changes" error will happen. 
> 1) Create two parquet files.
> {code}
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select
CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (1.122 seconds)
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from (select
CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (0.552 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> {code}
> 2) Copy the parquet files from bof_repro_1 to bof_repro_2.
> {code}
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
> Found 1 items
> -rw-r--r--   3 mapr mapr        415 2017-07-25 11:46 /tmp/bof_repro_1/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
> Found 1 items
> -rw-r--r--   3 mapr mapr        368 2017-07-25 11:46 /tmp/bof_repro_2/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet /tmp/bof_repro_2/0_0_1.parquet
> [root@naravm1 ~]#
> {code}
> 3) Query the table.
> {code}
> 0: jdbc:drill:schema=dfs> ALTER SESSION SET  `planner.enable_streamagg`=false;
> +-------+------------------------------------+
> |  ok   |              summary               |
> +-------+------------------------------------+
> | true  | planner.enable_streamagg updated.  |
> +-------+------------------------------------+
> 1 row selected (0.124 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> +------------------+----------+
> |      MYCOL       |  Bucket  |
> +------------------+----------+
> | ["hello","hai"]  | Bucket1  |
> | null             | Bucket1  |
> +------------------+----------+
> 2 rows selected (0.247 seconds)
> 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 group by
bucket;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema changes
> Fragment 0:0
> [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] (state=,code=0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message