orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aliaksei Sandryhaila <asand...@apache.org>
Subject Re: Proposed metadata for ORC files
Date Fri, 25 Sep 2015 17:47:37 GMT
Hi Owen,

This will be a very useful statistic for resource reservation.

A couple of obvious suggestions (to make sure they sound reasonable):
- make this statistic optional or re. Only list and array data types 
really need it;
- store the statistic in each stripe footer (stripe-level max instances 
per 1024 rows) and file footer (file-level max instances per 1024 rows).

Since ORC files are written primarily with Hive now, how soon can this 
statistic be added to Hive's ORC writer?

Thank you,
Aliaksei.


On 09/24/2015 04:40 PM, Owen O'Malley wrote:
> All,
>     While thinking about making resource management for vectorized ORC
> readers, one of the difficult points is figuring out how big the vectors
> for the nested types need to be.  I'd like to propose that we add a
> statistic for each column that records the maximum number of instances we
> need for each vector row group of 1024 rows.
>
>    Having that number would let you set the vector row batch for the complex
> types as you are starting each stripe as well as being able to predict how
> much memory the reader will need.
>
> Thoughts?
>     Owen
>


Mime
View raw message