hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Remus Rusanu (JIRA)" <>
Subject [jira] [Updated] (HIVE-5998) Add vectorized reader for Parquet files
Date Mon, 10 Feb 2014 14:20:23 GMT


Remus Rusanu updated HIVE-5998:

    Status: Patch Available  (was: Open)

This fix provides vectorization execution on top of the normal ParquetInputFormat. No changes
are required to the table declaration. 
This implementation does not cross the border between Hive and Parquet and as such it uses
the exiting Hive parquet record reader, which is row mode. The vectorized output is 'shallow',
provided on top of the row mode by iterating. This is not optimal for vectorized execution,
but none the less this first step provides benefits of the vectorized operators for Parquet
format.  Going forward a deep vectorized reader would be required but such an endeavour requires
changes in the Parquet side of the border (the parquet-mr project). Bringing Hive dependencies
like VectorizationContext and VectorizedRowBatch into parquet-mr is not feasible imho now
(there are bandwith/capacity issues from me/Eric/Jitendra but also engineering issues, like
circular dependencies). A deep vectorized reader inside the parqeut-mr would have to be based
on a design that consider other possible vectorized engines consumers (hint: Pig). 

> Add vectorized reader for Parquet files
> ---------------------------------------
>                 Key: HIVE-5998
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Remus Rusanu
>            Assignee: Remus Rusanu
>            Priority: Minor
>         Attachments: HIVE-5998.1.patch
> HIVE-5783 is adding native Parquet support in Hive. As Parquet is a columnar format,
it makes sense to provide a vectorized reader, similar to how RC and ORC formats have, to
benefit from vectorized execution engine.

This message was sent by Atlassian JIRA

View raw message