hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dong Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8128) Improve Parquet Vectorization
Date Mon, 13 Jul 2015 08:47:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624378#comment-14624378
] 

Dong Chen commented on HIVE-8128:
---------------------------------

Hi [~nezihyigitbasi], I updated and run Hive POC based on the latest changes at your repo:
https://github.com/nezihyigitbasi-nflx/parquet-mr/commits/vector
All looks good. Thanks. 

During development, I got some thoughts about the vector API. Could you help to take a look
at them?

* In {{ColumnVector}}, how about adding two attributes: one is {{boolean noNulls}}, which
indicates whether the whole column vector has no null value. The other is {{boolean isRepeating}},
which indicates whether the same value repeats for whole column vector. They could be calculated
at the same time when we read a vector. 
The reason we want them is that Hive vector engine can check these attribute to skip some
values. And it might be better to calculate them in Parquet once, instead of calculate them
by re-visit vectors again in Hive. (Not sure other engines need this. But it should be ok
that Parquet supports this.)
* In {{RowBatch}}, how about adding one attribute {{int size}}, which indicates the number
of rows in this batch. This is just for easy usage. Its value should be the same as {{RowBatch.columns\[0\].numValues}}.

What do you think?

> Improve Parquet Vectorization
> -----------------------------
>
>                 Key: HIVE-8128
>                 URL: https://issues.apache.org/jira/browse/HIVE-8128
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Brock Noland
>            Assignee: Dong Chen
>             Fix For: parquet-branch
>
>         Attachments: HIVE-8128-parquet.patch.POC, HIVE-8128.1-parquet.patch
>
>
> NO PRECOMMIT TESTS
> We'll want to do is finish the vectorization work (e.g. VectorizedOrcSerde, VectorizedOrcSerde)
which was partially done in HIVE-5998.
> As discussed in PARQUET-131, we will work out Hive POC based on the new Parquet vectorized
API, and then finish the implementation after finilized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message