orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <alanfga...@gmail.com>
Subject Re: Bloom filter hash broken
Date Wed, 07 Sep 2016 18:12:51 GMT
I think using the default encoding for the old files is the best option, as it will be right
99% of the time.  I was wondering how the system would know whether or not this was an old
file.

Alan.

> On Sep 7, 2016, at 10:06, Owen O'Malley <omalley@apache.org> wrote:
> 
> 4 is about when you are using the bloom filter for predicate push down. I'm
> saying old files should use the default encoding when checking the bloom
> filter. The other option is to always have the predicate push down say
> maybe if the file is an old one.
> 
> .. Owen
> 
> On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates <alanfgates@gmail.com> wrote:
> 
>> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
>> encoding and use that?  Is there a versioning concept in the bloom filters
>> that will make it easy to determine if this is pre or post ORC-101?
>> 
>> Alan.
>> 
>>> On Sep 7, 2016, at 08:57, Owen O'Malley <omalley@apache.org> wrote:
>>> 
>>> All,
>>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
>>> filters are currently using the default character encoding. That makes
>> the
>>> bloom filters non-portable between different computers that use different
>>> default encodings. I've filed ORC-101 to address it, but I want to have a
>>> wider discussion. I'd propose that we:
>>> 
>>> 1. create a new WriterVersion for ORC-101.
>>> 2. move the bloom filter code from storage-api into ORC.
>>> 3. consistently use UTF-8 when creating new bloom filters
>>> 4. for ORC files older than ORC-101, test the default encoding instead of
>>> UTF-8
>>> 
>>> Thoughts?
>>> 
>>> .. Owen
>> 
>> 


Mime
View raw message