drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Gilmore (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-2286) Parquet compression causes read errors
Date Mon, 23 Feb 2015 01:30:12 GMT
Adam Gilmore created DRILL-2286:
-----------------------------------

             Summary: Parquet compression causes read errors
                 Key: DRILL-2286
                 URL: https://issues.apache.org/jira/browse/DRILL-2286
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 0.8.0
            Reporter: Adam Gilmore
            Assignee: Steven Phillips
            Priority: Critical


>From what I can see, since compression has been added to the Parquet writer, reading errors
can occur.

Basically, things like timestamp and decimal are stored as int64 with some metadata.  It appears
that when the column is compressed, it tries to read int64s into a vector of timestamp/decimal
types, which causes a cast error.

Here's the JSON file I'm using:

{ "a": 1.5 }
{ "a": 3.5 }
{ "a": 1.5 }
{ "a": 2.5 }
{ "a": 1.5 }
{ "a": 5.5 }
{ "a": 1.5 }
{ "a": 6.0 }
{ "a": 1.5 }

Now create a Parquet table like so:

create table dfs.tmp.test as (select cast(a as decimal(18,8)) from dfs.tmp.`test.json`)

Now when you try to query it like so:

0: jdbc:drill:zk=local> select * from dfs.tmp.test;
Query failed: RemoteRpcException: Failure while running fragment., org.apache.drill.exec.vector.NullableDecimal18Vector
cannot be cast to org.apache.drill.exec.vector.NullableBigIntVector [ 91e23d42-fa06-4429-b78e-3ff32352e660
on ...:31010 ]
[ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]

Error: exception while executing query: Failure while executing query. (state=,code=0)

This is the same for timestamps, for example.

The relevant code is in ColumnReaderFactory whereby if the column chunk is encoded, it creates
specific readers based on the type of the column (in this case int64, instead of timestamp/decimal).

This is pretty severe, as it looks like the compression is enabled by default now.  I do note
that with only 1-2 records in the JSON file, it doesn't bother compressing and the queries
then work fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message