drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Altekruse (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)
Date Wed, 10 Feb 2016 02:51:18 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140236#comment-15140236
] 

Jason Altekruse commented on DRILL-4349:
----------------------------------------

As I was rolling the rc3 release candidate for 1.5.0 I decided to apply this fix to the release
branch as it seemed useful to get into the release. The commit hash will be different but
the patch applied cleanly and has an identical diff represented.

> parquet reader returns wrong results when reading a nullable column that starts with
a large number of nulls (>30k)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4349
>                 URL: https://issues.apache.org/jira/browse/DRILL-4349
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.4.0
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>            Priority: Critical
>             Fix For: 1.5.0
>
>         Attachments: drill4349.tar.gz
>
>
> While reading a nullable column, if in a single pass we only read null values, the parquet
reader resets the value of pageReader.readPosInBytes which will lead to wrong data read from
the file.
> To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, val) with 50100
rows, where id equals to the row number and val is empty for the first 50k rows, and equal
to id for the remaining rows.
> create a parquet table from the csv file:
> {noformat}
> CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, CAST(NULLIF(columns[1],
'') AS DOUBLE) AS val from `repro.csv`;
> {noformat}
> Now if you query any of the non null values you will get wrong results:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=50000 limit 10;
> +--------+---------------------------+
> |   id   |            val            |
> +--------+---------------------------+
> | 50000  | 9.11337776337441E-309     |
> | 50001  | 3.26044E-319              |
> | 50002  | 1.4916681476489723E-154   |
> | 50003  | 2.0000000018890676        |
> | 50004  | 2.681561588521345E154     |
> | 50005  | -2.1016574E-317           |
> | 50006  | -1.4916681476489723E-154  |
> | 50007  | -2.0000000018890676       |
> | 50008  | -2.681561588521345E154    |
> | 50009  | 2.1016574E-317            |
> +--------+---------------------------+
> 10 rows selected (0.238 seconds)
> {noformat}
> and here are the expected values:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as int)>=50000
limit 10;
> +--------------------+
> |      columns       |
> +--------------------+
> | ["50000","50000"]  |
> | ["50001","50001"]  |
> | ["50002","50002"]  |
> | ["50003","50003"]  |
> | ["50004","50004"]  |
> | ["50005","50005"]  |
> | ["50006","50006"]  |
> | ["50007","50007"]  |
> | ["50008","50008"]  |
> | ["50009","50009"]  |
> +--------------------+
> {noformat}
> I confirmed that the file is written correctly and the issue is in the parquet reader
(already have a fix for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message