orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth J <j.prasant...@gmail.com>
Subject Re: ORC Indexing
Date Thu, 16 Jul 2015 16:16:13 GMT
Recently, bloom filter index is added to ORC which is much more accurate in row group elimination
than min/max based index.

Thanks
Prasanth

> On Jul 16, 2015, at 9:07 AM, Thomas Abeler <thomas@sensenetworks.com> wrote:
> 
> Hey,
> 
>  
> 
> i have an question about how indexing in ORC works
> 
>  
> 
> The way I understood ORC indexing is, that ORC keeps statistics (min, max, sum) about
the rows every 10'000 rows (by default )and if I query the data it looks at the statistics
to figure out if it needs to read the row chunk or not.
> 
>  
> 
> If that's true - is it possible to build an index on an ORC file that is more similar
to an database index - meaning that i want to create another sorted data structure which holds
the field value and a pointer to the record it relates to.
> 
>  
> 
> The problem i have is that i have a huge dataset. >300TB and 69 columns. There is
no 'key' column that gets frequently queried and i would like to perform ad-hoc queries on
nearly every of these columns. I think building an index on ever column would be a good approach
to get this ability.
> 
>  
> 
> Regards,
> 
> Thomas
> 


Mime
View raw message