drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3533) null values in a sub-structure in Parquet returns unexpected/misleading results
Date Wed, 22 Jul 2015 00:40:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636073#comment-14636073
] 

Jinfeng Ni commented on DRILL-3533:
-----------------------------------

The plan for the query against parquet file seems to make sense. I debug a little bit, and
seems that the issues comes from parquet scan reader.  Essentially, the parquet scan reader
returns a schema which contains a map vector with filed of "budgetLevel", in stead of "adults".
This cause the downstream operator to produce the wrong result.

I made a small change to parquet scan reader.  And it seems to return the correct result with
the fix.  I'll run the whole regression to see if the fix causes any regression.

{code}

select p.dimensions.budgetLevel as `field1`, lower(p.dimensions.adults) as `field2` from dfs.tmp.`/test/0_0_0.parquet`
as p;
+---------+---------+
| field1  | field2  |
+---------+---------+
| null    | a       |
+---------+---------+
1 row selected (0.261 seconds)
0: jdbc:drill:zk=local> select p.dimensions.budgetLevel as `field1`, lower(p.dimensions.adults)
as `field2` from dfs.tmp.`test.json` as p;
+---------+---------+
| field1  | field2  |
+---------+---------+
| null    | a       |
+---------+---------+
1 row selected (0.235 seconds)
{code}


> null values in a sub-structure in Parquet returns unexpected/misleading results
> -------------------------------------------------------------------------------
>
>                 Key: DRILL-3533
>                 URL: https://issues.apache.org/jira/browse/DRILL-3533
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.1.0
>            Reporter: Stefán Baxter
>            Assignee: Jinfeng Ni
>            Priority: Critical
>
> With this minimal dataset as /tmp/test.json:
> {"dimensions":{"adults":"A"}}
> select lower(p.dimensions.budgetLevel) as `field1`, lower(p.dimensions.adults) as `field2`
from dfs.tmp.`/test.json` as p;
> Returns this:
> +---------+---------+
> | field1  | field2  |
> +---------+---------+
> | null    | a       |
> +---------+---------+
> With the same data as a Parquet file
> CREATE TABLE dfs.tmp.`/test` AS SELECT * FROM dfs.tmp.`/test.json`;
> The same query:
> select lower(p.dimensions.budgetLevel) as `field1`, lower(p.dimensions.adults) as `field2`
from dfs.tmp.`/test/0_0_0.parquet` as p;
> Return this:
> +---------+---------+
> | field1  | field2  |
> +---------+---------+
> | a       | null    |
> +---------+---------+
> After some more testing it appears that this has nothing to do with trim. (any non existing
nested-value will be pushed aside)
> select p.dimensions.budgetLevel as `field1`, lower(p.dimensions.adults) as `field2` from
dfs.tmp.`/test/0_0_0.parquet` as p;
> also returns:
> +---------+---------+
> | field1  | field2  |
> +---------+---------+
> | a       | null    |
> +---------+---------+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message