hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dong Chen (JIRA)" <>
Subject [jira] [Commented] (HIVE-8128) Improve Parquet Vectorization
Date Mon, 24 Nov 2014 08:19:13 GMT


Dong Chen commented on HIVE-8128:

To improve Parquet Vectorization, I think we need following changes, and they should be based
on PARQUET-131. These are some initial thoughts and I will make them more specific after working
on parquet side for a while.

Assuming the RecordReader in Hive will get data of type {{ParquetVectorizedRowBatch}}.

1. The next() method of {{VectorizedParquetRecordReader}} should be {{next(NullWritable key,
ParquetVectorizedRowBatch outputBatch)}}. This will let Hive get a vectorized batch of rows
of Parquet at a time.

2. A {{VectorizedParquetHiveSerDe}} will be added to convert {{ParquetVectorizedRowBatch}}
to Hive recognized {{VectorizedRowBatch}}. In order to make conversion efficiently, the Parquet
vectorized API design might consider this. The more similar between the 2 kinds of row batch,
the better.

3. The support for partition has been in trunk. Whether it works for Parquet should be verified
after main work is done, and make possible changes if neccessary.

> Improve Parquet Vectorization
> -----------------------------
>                 Key: HIVE-8128
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Brock Noland
>            Assignee: Dong Chen
> We'll want to do is finish the vectorization work (e.g. VectorizedOrcSerde, VectorizedOrcSerde)
which was partially done in HIVE-5998.

This message was sent by Atlassian JIRA

View raw message