drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deneche A. Hakim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2300) problems reading DECIMAL from parquet file
Date Wed, 25 Feb 2015 00:36:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335728#comment-14335728
] 

Deneche A. Hakim commented on DRILL-2300:
-----------------------------------------

It's interesting to note that the simple reader will read all DECIMAL types correctly when
they are {{OPTIONAL}} and with {{PLAIN}} encoding

> problems reading DECIMAL from parquet file
> ------------------------------------------
>
>                 Key: DRILL-2300
>                 URL: https://issues.apache.org/jira/browse/DRILL-2300
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 0.7.0
>            Reporter: Deneche A. Hakim
>            Assignee: Steven Phillips
>             Fix For: Future
>
>
> There are several issues related to DECIMAL columns in parquet files, depending on which
reader is used "simple reader" or "complex reader" and if the columns are OPTIONAL or REQUIRED.
> I have the following {{test.json}} file:
> {code}
> { "a": "1" }
> { "a": "1" }
> { "a": "1" }
> {code}
> I created a parquet file using the following query:
> {noformat}
> CREATE TABLE dfs.tmp.`test_all_decimal` AS 
>   SELECT 
>     CAST(a AS DECIMAL(9,6)) decimal9_opt, 
>     CAST('1' AS DECIMAL(9,6)) decimal9_req, 
>     CAST(a AS DECIMAL(18, 8)) decimal18_opt, 
>     CAST('1' AS DECIMAL(18, 8)) decimal18_req, 
>     CAST(a AS DECIMAL(28,10)) decimal28_col, 
>     CAST('1' AS DECIMAL(28,10)) decimal28_req, 
>     CAST(a AS DECIMAL(38,10)) decimal38_col, 
>     CAST('1' AS DECIMAL(38,10)) decimal38_req 
>   FROM dfs.data.`test_char.json`;
> {noformat}
> This creates a parquet file with the following metadata, retrieved using {{parquet tools}}:
> {noformat}
> creator:       parquet-mr 
> file schema:   root 
> -----------------------------------------------------------------------------------------------------
> decimal9_opt:  OPTIONAL INT32 O:DECIMAL R:0 D:1
> decimal9_req:  REQUIRED INT32 O:DECIMAL R:0 D:0
> decimal18_opt: OPTIONAL INT64 O:DECIMAL R:0 D:1
> decimal18_req: REQUIRED INT64 O:DECIMAL R:0 D:0
> decimal28_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> decimal28_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
> decimal38_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> decimal38_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
> row group 1:   RC:3 TS:636 
> -----------------------------------------------------------------------------------------------------
> decimal9_opt:   INT32 SNAPPY DO:0 FPO:4 SZ:62/58/0.94 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
> decimal9_req:   INT32 SNAPPY DO:0 FPO:66 SZ:56/52/0.93 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED
> decimal18_opt:  INT64 SNAPPY DO:0 FPO:122 SZ:74/70/0.95 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
> decimal18_req:  INT64 SNAPPY DO:0 FPO:196 SZ:68/64/0.94 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED
> decimal28_col:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:264 SZ:72/91/1.26 VC:3 ENC:RLE,BIT_PACKED,PLAIN
> decimal28_req:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:336 SZ:66/85/1.29 VC:3 ENC:BIT_PACKED,PLAIN
> decimal38_col:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:402 SZ:80/111/1.39 VC:3 ENC:RLE,BIT_PACKED,PLAIN
> decimal38_req:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:482 SZ:77/105/1.36 VC:3 ENC:BIT_PACKED,PLAIN
> {noformat}
> If we disable dictionary encoding:
> {noformat}
> alter session set `store.parquet.enable_dictionary_encoding` = false;
> {noformat}
> We will get a simlar parquet file but with {{PLAIN}} encoding instead of {{PLAIN_DICTIONARY}}
for {{DECIMAL9}} and {{DECIMAL18}} columns.
> When using the "simple" parquet reader, with dictionary encoding enabled,  
> The following query returns wrong results for {{DECIMAL28/REQUIRED}} and {{DECIMAL38/REQUIRED}}
(we can't read {{DECIMAL9}} nor {{DECIMAL18}} columns because of DRILL-2262):
> {noformat}
> select decimal28_col, decimal28_req, decimal38_col, decimal38_req from dfs.tmp.`test_all_decimal`;
> +---------------+---------------+---------------+---------------+
> | decimal28_opt | decimal28_req | decimal38_opt | decimal38_req |
> +---------------+---------------+---------------+---------------+
> | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
> | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
> | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
> +---------------+---------------+---------------+---------------+
> {noformat}
>  
> When dictionary encoding is disabled, the following query eturns wrong results for {{DECIMAL28/REQUIRED}}
and {{DECIMAL38/REQUIRED}}:
> {noformat}
> select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req, decimal28_opt, decimal28_req,
decimal38_opt, decimal38_req from dfs.tmp.`test_all_decimal_nodictionary`;
> +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
> | decimal9_opt | decimal9_req | decimal18_opt | decimal18_req | decimal28_opt | decimal28_req
| decimal38_opt | decimal38_req |
> +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
> | 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000
| 1.0000000000  | 100000000.0000000000 |
> | 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000
| 1.0000000000  | 100000000.0000000000 |
> | 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000
| 1.0000000000  | 100000000.0000000000 |
> +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
> {noformat}
> When using the "complex" reader:
> {noformat}
> alter session set `store.parquet.use_new_reader` = true;
> {noformat}
> We can't read {{DECIMAL28}} nor {{DECIMAL38}} because of DRILL-2220.
> the following query gives wrong results for {DECIMAL9}} and {{DECIMAL18}} no matter if
dictionary encoding is enabled or disabled:
> {noformat}
> select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req from dfs.tmp.`test_all_decimal`;
> +--------------+--------------+---------------+---------------+
> | decimal9_opt | decimal9_req | decimal18_opt | decimal18_req |
> +--------------+--------------+---------------+---------------+
> | 1000000      | 1000000      | 100000000     | 100000000     |
> | 1000000      | 1000000      | 100000000     | 100000000     |
> | 1000000      | 1000000      | 100000000     | 100000000     |
> +--------------+--------------+---------------+---------------+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message