hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dong Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8128) Improve Parquet Vectorization
Date Tue, 21 Jul 2015 08:50:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634790#comment-14634790
] 

Dong Chen commented on HIVE-8128:
---------------------------------

Patch V6 updated. Review board: https://reviews.apache.org/r/36540/

The patch depends on the new Parquet vector API at https://github.com/nezihyigitbasi-nflx/parquet-mr/commits/vector

In this POC, the general workflow was done, two tests passed, and INT type was supported.
The idea is that we create a VectorizedParquetRecordReader, which wraps the ParquetRecordReader
provided by Parquet. Then in its next() method, we convert Parquet RowBatch to Hive VectorizedRowBatch.

This is the first patch. To complete vectorization feature, we still have work to do in follow-up:
1) support all data types 2) support partition column 3) add more test cases 4) evaluate performance
on a real cluster.

> Improve Parquet Vectorization
> -----------------------------
>
>                 Key: HIVE-8128
>                 URL: https://issues.apache.org/jira/browse/HIVE-8128
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Brock Noland
>            Assignee: Dong Chen
>             Fix For: parquet-branch
>
>         Attachments: HIVE-8128-parquet.patch.POC, HIVE-8128.1-parquet.patch
>
>
> NO PRECOMMIT TESTS
> We'll want to do is finish the vectorization work (e.g. VectorizedOrcSerde, VectorizedOrcSerde)
which was partially done in HIVE-5998.
> As discussed in PARQUET-131, we will work out Hive POC based on the new Parquet vectorized
API, and then finish the implementation after finilized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message