drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul Challapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5183) Drill doesn't seem to handle array values correctly in Parquet files
Date Fri, 30 Jun 2017 18:38:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070547#comment-16070547
] 

Rahul Challapalli commented on DRILL-5183:
------------------------------------------

Hmm.....this is clearly a bug we should fix. Have you considered the below workaround using
a view
{code}
0: jdbc:drill:zk=10.10.100.190:5181> create view drill5183 as select d.title, d.pages,
d.authors.`array` authors from dfs.`/drill/testdata/books.parquet` d;
+-------+---------------------------------------------------------------------+
|  ok   |                               summary                               |
+-------+---------------------------------------------------------------------+
| true  | View 'drill5183' created successfully in 'dfs.drillTestDir' schema  |
+-------+---------------------------------------------------------------------+
1 row selected (0.403 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> select * from drill5183;
+---------------------------------------+--------+------------------------------------------------+
|                 title                 | pages  |                    authors            
        |
+---------------------------------------+--------+------------------------------------------------+
| Physics of Waves                      | 477    | ["William C. Elmore","Mark A. Heald"] 
        |
| Foundations of Mathematical Analysis  | 428    | ["Richard Johnsonbaugh","W.E. Pfaffenberger"]
 |
+---------------------------------------+--------+------------------------------------------------+
2 rows selected (0.33 seconds)
{code}

> Drill doesn't seem to handle array values correctly in Parquet files
> --------------------------------------------------------------------
>
>                 Key: DRILL-5183
>                 URL: https://issues.apache.org/jira/browse/DRILL-5183
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Dave Kincaid
>         Attachments: books.parquet
>
>
> It looks to me that Drill is not properly converting array values in Parquet records.
I have created a simple example and will attach a simple Parquet file to this issue. If I
write Parquet records using the Avro schema
> {code:title=Book.avsc}
> { "type": "record",
>   "name": "Book",
>   "fields": [
>     { "name": "title", "type": "string" },
>     { "name": "pages", "type": "int" },
>     { "name": "authors", "type": {"type": "array", "items": "string"} }
>   ]
> }
> {code}
> I write two records using this schema into the attached Parquet file and then simply
run {{SELECT * FROM dfs.`books.parquet`}} I get the following result:
> ||title||pages||authors||
> |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}|
> |Foundations of Mathematical Analysis|428|{"array":["Richard Johnsonbaugh","W.E. Pfaffenberger"]}|
> You can see that the authors column seems to be a nested record with the name "array"
instead of being a repeated value. If I change the SQL query to {{SELECT title,pages,t.authors.`array`
FROM dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then I get:
> ||title||pages||EXPR$2||
> |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]|
> |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. Pfaffenberger"]|
> and now that column behaves in Drill as a repeated values column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message