hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdinand Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15055) Column pruning for nested fields in Parquet
Date Fri, 23 Dec 2016 01:55:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15771609#comment-15771609
] 

Ferdinand Xu commented on HIVE-15055:
-------------------------------------

Thanks [~csun] for the design document and benchmark information. Could you add a little more
to describe what query and table structure you used in the benchmark for a better understand?

> Column pruning for nested fields in Parquet
> -------------------------------------------
>
>                 Key: HIVE-15055
>                 URL: https://issues.apache.org/jira/browse/HIVE-15055
>             Project: Hive
>          Issue Type: New Feature
>          Components: Logical Optimizer, Physical Optimizer, Serializers/Deserializers
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>              Labels: performance
>         Attachments: benchmark-hos.pdf, design-doc-nested-column-pruning.pdf
>
>
> Some columnar file formats such as Parquet store fields in struct type also column by
column using encoding described in Google Dramel pager. It's very common in big data where
data are stored in structs while queries only needs a subset of the the fields in the structs.
However, presently Hive still needs to read the whole struct regardless whether all fields
are selected. Therefore, pruning unwanted sub-fields in struct or nested fields at file reading
time would be a big performance boost for such scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message