From issues-return-48528-archive-asf-public=cust-asf.ponee.io@drill.apache.org  Tue Jan  9 15:28:07 2018
Return-Path: <issues-return-48528-archive-asf-public=cust-asf.ponee.io@drill.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id CB8D318072F
	for <archive-asf-public@eu.ponee.io>; Tue,  9 Jan 2018 15:28:07 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id BBE18160C13; Tue,  9 Jan 2018 14:28:07 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id E4F05160C2D
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  9 Jan 2018 15:28:06 +0100 (CET)
Received: (qmail 50584 invoked by uid 500); 9 Jan 2018 14:28:06 -0000
Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@drill.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@drill.apache.org>
List-Post: <mailto:issues@drill.apache.org>
List-Id: <issues.drill.apache.org>
Reply-To: dev@drill.apache.org
Delivered-To: mailing list issues@drill.apache.org
Received: (qmail 50341 invoked by uid 99); 9 Jan 2018 14:28:06 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jan 2018 14:28:06 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 95A09180794
	for <issues@drill.apache.org>; Tue,  9 Jan 2018 14:28:05 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -107.911
X-Spam-Level:
X-Spam-Status: No, score=-107.911 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01,
	USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id K9AIXn5kD0kc for <issues@drill.apache.org>;
	Tue,  9 Jan 2018 14:28:03 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id DAF915F3E2
	for <issues@drill.apache.org>; Tue,  9 Jan 2018 14:28:02 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CD923E126D
	for <issues@drill.apache.org>; Tue,  9 Jan 2018 14:28:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4D6D9240F2
	for <issues@drill.apache.org>; Tue,  9 Jan 2018 14:28:01 +0000 (UTC)
Date: Tue, 9 Jan 2018 14:28:01 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.13118609.1510767704000.608644.1515508081316@Atlassian.JIRA>
In-Reply-To: <JIRA.13118609.1510767704000@Atlassian.JIRA>
References: <JIRA.13118609.1510767704000@Atlassian.JIRA> <JIRA.13118609.1510767704140@jira-lw-us.apache.org>
Subject: [jira] [Commented] (DRILL-5970) DrillParquetReader always builds
 the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318489#comment-16318489 ] 

ASF GitHub Bot commented on DRILL-5970:
---------------------------------------

Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1047#discussion_r160419416
  
    --- Diff: exec/vector/src/main/codegen/templates/BaseWriter.java ---
    @@ -106,37 +114,37 @@
         MapOrListWriter list(String name);
         boolean isMapWriter();
         boolean isListWriter();
    -    UInt1Writer uInt1(String name);
    -    UInt2Writer uInt2(String name);
    -    UInt4Writer uInt4(String name);
    -    UInt8Writer uInt8(String name);
    -    VarCharWriter varChar(String name);
    -    Var16CharWriter var16Char(String name);
    -    TinyIntWriter tinyInt(String name);
    -    SmallIntWriter smallInt(String name);
    -    IntWriter integer(String name);
    -    BigIntWriter bigInt(String name);
    -    Float4Writer float4(String name);
    -    Float8Writer float8(String name);
    -    BitWriter bit(String name);
    -    VarBinaryWriter varBinary(String name);
    +    UInt1Writer uInt1(String name, TypeProtos.DataMode dataMode);
    --- End diff --
    
    > The mode passed here cannot be REPEATED: just won't work. So, can we pass REQUIRED?
    Yes, we can with my improvements - for OPTIONAL dataMode we will pass Nullable Vectors and usual one for REQUIRED.
    
    The proposed alternative is interesting, I will think how to implement it in the best way. Maybe replacing SingleMapWriter with OptionalMapWriter and RequredMapWriter could be done. 
    
    Note: here are some good code refactoring was done, so this PR shouldn't be closed.


> DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones
> -------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5970
>                 URL: https://issues.apache.org/jira/browse/DRILL-5970
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen, Execution - Data Types, Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>
> The root cause of the issue is that adding REQUIRED (not-nullable) data types to the container in the all MapWriters is not implemented.
> It can lead to get invalid schema. 
> {code}
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (2.376 seconds)
> {code}
> Run from Drill unit test framework (to see "data mode"):
> {code}
> @Test
>   public void test() throws Exception {
>     setColumnWidths(new int[] {25, 25});
>     List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from dfs.tmp.bof_repro_1");
>     printResult(queryDataBatches);
>   }
> 1 row(s):
> -------------------------------------------------------
> | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
> -------------------------------------------------------
> | ["hello","hai"]          | Bucket1                  |
> -------------------------------------------------------
> Total record count: 1
> {code}
> {code}
> vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet 
> message root {
>   repeated binary MYCOL (UTF8);
>   required binary Bucket (UTF8);
> }
> {code}
> To simulate of obtaining the wrong result you can try the query with aggregation by using a new parquet reader (used by default for complex data types) and old parquet reader. False "Hash aggregate does not support schema changes" error will happen. 
> 1) Create two parquet files.
> {code}
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (1.122 seconds)
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (0.552 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> {code}
> 2) Copy the parquet files from bof_repro_1 to bof_repro_2.
> {code}
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
> Found 1 items
> -rw-r--r--   3 mapr mapr        415 2017-07-25 11:46 /tmp/bof_repro_1/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
> Found 1 items
> -rw-r--r--   3 mapr mapr        368 2017-07-25 11:46 /tmp/bof_repro_2/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet /tmp/bof_repro_2/0_0_1.parquet
> [root@naravm1 ~]#
> {code}
> 3) Query the table.
> {code}
> 0: jdbc:drill:schema=dfs> ALTER SESSION SET  `planner.enable_streamagg`=false;
> +-------+------------------------------------+
> |  ok   |              summary               |
> +-------+------------------------------------+
> | true  | planner.enable_streamagg updated.  |
> +-------+------------------------------------+
> 1 row selected (0.124 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> +------------------+----------+
> |      MYCOL       |  Bucket  |
> +------------------+----------+
> | ["hello","hai"]  | Bucket1  |
> | null             | Bucket1  |
> +------------------+----------+
> 2 rows selected (0.247 seconds)
> 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 group by bucket;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema changes
> Fragment 0:0
> [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] (state=,code=0)
> {code}


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)