From issues-return-48528-archive-asf-public=cust-asf.ponee.io@drill.apache.org Tue Jan 9 15:28:07 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id CB8D318072F for ; Tue, 9 Jan 2018 15:28:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id BBE18160C13; Tue, 9 Jan 2018 14:28:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E4F05160C2D for ; Tue, 9 Jan 2018 15:28:06 +0100 (CET) Received: (qmail 50584 invoked by uid 500); 9 Jan 2018 14:28:06 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 50341 invoked by uid 99); 9 Jan 2018 14:28:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jan 2018 14:28:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 95A09180794 for ; Tue, 9 Jan 2018 14:28:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -107.911 X-Spam-Level: X-Spam-Status: No, score=-107.911 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id K9AIXn5kD0kc for ; Tue, 9 Jan 2018 14:28:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id DAF915F3E2 for ; Tue, 9 Jan 2018 14:28:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CD923E126D for ; Tue, 9 Jan 2018 14:28:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4D6D9240F2 for ; Tue, 9 Jan 2018 14:28:01 +0000 (UTC) Date: Tue, 9 Jan 2018 14:28:01 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318489#comment-16318489 ] ASF GitHub Bot commented on DRILL-5970: --------------------------------------- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/1047#discussion_r160419416 --- Diff: exec/vector/src/main/codegen/templates/BaseWriter.java --- @@ -106,37 +114,37 @@ MapOrListWriter list(String name); boolean isMapWriter(); boolean isListWriter(); - UInt1Writer uInt1(String name); - UInt2Writer uInt2(String name); - UInt4Writer uInt4(String name); - UInt8Writer uInt8(String name); - VarCharWriter varChar(String name); - Var16CharWriter var16Char(String name); - TinyIntWriter tinyInt(String name); - SmallIntWriter smallInt(String name); - IntWriter integer(String name); - BigIntWriter bigInt(String name); - Float4Writer float4(String name); - Float8Writer float8(String name); - BitWriter bit(String name); - VarBinaryWriter varBinary(String name); + UInt1Writer uInt1(String name, TypeProtos.DataMode dataMode); --- End diff -- > The mode passed here cannot be REPEATED: just won't work. So, can we pass REQUIRED? Yes, we can with my improvements - for OPTIONAL dataMode we will pass Nullable Vectors and usual one for REQUIRED. The proposed alternative is interesting, I will think how to implement it in the best way. Maybe replacing SingleMapWriter with OptionalMapWriter and RequredMapWriter could be done. Note: here are some good code refactoring was done, so this PR shouldn't be closed. > DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones > ------------------------------------------------------------------------------------------------------- > > Key: DRILL-5970 > URL: https://issues.apache.org/jira/browse/DRILL-5970 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Codegen, Execution - Data Types, Storage - Parquet > Affects Versions: 1.11.0 > Reporter: Vitalii Diravka > Assignee: Vitalii Diravka > > The root cause of the issue is that adding REQUIRED (not-nullable) data types to the container in the all MapWriters is not implemented. > It can lead to get invalid schema. > {code} > 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1))); > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 0_0 | 1 | > +-----------+----------------------------+ > 1 row selected (2.376 seconds) > {code} > Run from Drill unit test framework (to see "data mode"): > {code} > @Test > public void test() throws Exception { > setColumnWidths(new int[] {25, 25}); > List queryDataBatches = testSqlWithResults("select * from dfs.tmp.bof_repro_1"); > printResult(queryDataBatches); > } > 1 row(s): > ------------------------------------------------------- > | MYCOL | Bucket| > ------------------------------------------------------- > | ["hello","hai"] | Bucket1 | > ------------------------------------------------------- > Total record count: 1 > {code} > {code} > vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet > message root { > repeated binary MYCOL (UTF8); > required binary Bucket (UTF8); > } > {code} > To simulate of obtaining the wrong result you can try the query with aggregation by using a new parquet reader (used by default for complex data types) and old parquet reader. False "Hash aggregate does not support schema changes" error will happen. > 1) Create two parquet files. > {code} > 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1))); > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 0_0 | 1 | > +-----------+----------------------------+ > 1 row selected (1.122 seconds) > 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1))); > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 0_0 | 1 | > +-----------+----------------------------+ > 1 row selected (0.552 seconds) > 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2; > {code} > 2) Copy the parquet files from bof_repro_1 to bof_repro_2. > {code} > [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1 > Found 1 items > -rw-r--r-- 3 mapr mapr 415 2017-07-25 11:46 /tmp/bof_repro_1/0_0_0.parquet > [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2 > Found 1 items > -rw-r--r-- 3 mapr mapr 368 2017-07-25 11:46 /tmp/bof_repro_2/0_0_0.parquet > [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet /tmp/bof_repro_2/0_0_1.parquet > [root@naravm1 ~]# > {code} > 3) Query the table. > {code} > 0: jdbc:drill:schema=dfs> ALTER SESSION SET `planner.enable_streamagg`=false; > +-------+------------------------------------+ > | ok | summary | > +-------+------------------------------------+ > | true | planner.enable_streamagg updated. | > +-------+------------------------------------+ > 1 row selected (0.124 seconds) > 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2; > +------------------+----------+ > | MYCOL | Bucket | > +------------------+----------+ > | ["hello","hai"] | Bucket1 | > | null | Bucket1 | > +------------------+----------+ > 2 rows selected (0.247 seconds) > 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 group by bucket; > Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema changes > Fragment 0:0 > [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] (state=,code=0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)