drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Gilmore (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2286) Parquet compression causes read errors
Date Mon, 23 Feb 2015 01:37:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332471#comment-14332471
] 

Adam Gilmore commented on DRILL-2286:
-------------------------------------

A workaround is to disable the dictionary encoding:

alter system set `store.parquet.enable_dictionary_encoding` = false;

> Parquet compression causes read errors
> --------------------------------------
>
>                 Key: DRILL-2286
>                 URL: https://issues.apache.org/jira/browse/DRILL-2286
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Steven Phillips
>            Priority: Critical
>
> From what I can see, since compression has been added to the Parquet writer, reading
errors can occur.
> Basically, things like timestamp and decimal are stored as int64 with some metadata.
 It appears that when the column is compressed, it tries to read int64s into a vector of timestamp/decimal
types, which causes a cast error.
> Here's the JSON file I'm using:
> {code}
> { "a": 1.5 }
> { "a": 3.5 }
> { "a": 1.5 }
> { "a": 2.5 }
> { "a": 1.5 }
> { "a": 5.5 }
> { "a": 1.5 }
> { "a": 6.0 }
> { "a": 1.5 }
> {code}
> Now create a Parquet table like so:
> create table dfs.tmp.test as (select cast(a as decimal(18,8)) from dfs.tmp.`test.json`)
> Now when you try to query it like so:
> {noformat}
> 0: jdbc:drill:zk=local> select * from dfs.tmp.test;
> Query failed: RemoteRpcException: Failure while running fragment., org.apache.drill.exec.vector.NullableDecimal18Vector
cannot be cast to org.apache.drill.exec.vector.NullableBigIntVector [ 91e23d42-fa06-4429-b78e-3ff32352e660
on ...:31010 ]
> [ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
> Error: exception while executing query: Failure while executing query. (state=,code=0)
> {noformat}
> This is the same for timestamps, for example.
> The relevant code is in ColumnReaderFactory whereby if the column chunk is encoded, it
creates specific readers based on the type of the column (in this case int64, instead of timestamp/decimal).
> This is pretty severe, as it looks like the compression is enabled by default now.  I
do note that with only 1-2 records in the JSON file, it doesn't bother compressing and the
queries then work fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message