orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aliaksei Sandryhaila <asand...@apache.org>
Subject Re: Proposed metadata for ORC files
Date Fri, 25 Sep 2015 17:47:37 GMT
Hi Owen,

This will be a very useful statistic for resource reservation.

A couple of obvious suggestions (to make sure they sound reasonable):
- make this statistic optional or re. Only list and array data types 
really need it;
- store the statistic in each stripe footer (stripe-level max instances 
per 1024 rows) and file footer (file-level max instances per 1024 rows).

Since ORC files are written primarily with Hive now, how soon can this 
statistic be added to Hive's ORC writer?

Thank you,

On 09/24/2015 04:40 PM, Owen O'Malley wrote:
> All,
>     While thinking about making resource management for vectorized ORC
> readers, one of the difficult points is figuring out how big the vectors
> for the nested types need to be.  I'd like to propose that we add a
> statistic for each column that records the maximum number of instances we
> need for each vector row group of 1024 rows.
>    Having that number would let you set the vector row batch for the complex
> types as you are starting each stripe as well as being able to predict how
> much memory the reader will need.
> Thoughts?
>     Owen

View raw message