orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: Bloom filter hash broken
Date Fri, 09 Sep 2016 02:25:15 GMT
Sounds good to me.

Should we add a version field to the BLOOM_FILTER_UTF8 to deal with any future problems?

One other thought, in the protobuf definition I think it would be more efficient to have the
bitset encoded as a byte[] to avoid the boxed long array.

> On Sep 8, 2016, at 3:42 PM, Owen O'Malley <omalley@apache.org> wrote:
> Dain,
>   That is a great point. I wasn't thinking about having to implement that
> in C++, where it would really suck. (It was hard enough dealing with the
> timezones in C++, so I should know better!) I got the approach from this
> morning working and pushed it as a branch
> https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552969395e5f5
> .
>  The heaviest hammer would be to just create a new file version, but that
> seems like overkill for this. (Although sooner or later we will get there,
> since we need new encodings for decimal.)
>  Ok, so how about:
> 1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8
> based bloom filters. It will be used for strings, chars, varchars, and
> decimal. (Hopefully all charsets use the same bytes for ascii characters,
> but I don't want to find the strange exceptions.)
> 2. We create a new config knob that let's you write both BLOOM_FILTER and
> BLOOM_FILTER_UTF8 for users while they are transitioning.
> 3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to BLOOM_FILTER
> if it is an old file.
> Thoughts?
> .. Owen
> On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <dain@iq80.com> wrote:
>>> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <omalley@apache.org> wrote:
>>> Ok, Prasanth found a problem with my proposed approach. In particular,
>> the
>>> old readers would misinterpret bloom filters from new files. Therefore,
>> I'd
>>> like to propose a more complicated solution:
>>> 1. We extend the stripe footer or bloom filter index to record the
>> default
>>> encoding when we are writing a string or decimal bloom filter.
>>> 2. When reading a bloom filter, we use the encoding if it is present.
>> Does that mean that you always write with he platform encoding?  This
>> would make using bloom filters for read in other programming languages
>> difficult because you would need to do a UTF_8 to some arbitrary character
>> encoding.  This will also make using these bloom filters in performance
>> critical sections (join loops) computationally expensive as you have to do
>> a transcode.
>> Also, I think the spec need to be clarified.  The spec does not state the
>> character encoding of the bloom filters.  I assumed it was UTF_8 to match
>> the normal string column encoding.  It looks like the spec does not
>> document the meaning of "the version of the writer” and what workarounds
>> are necessary (or operating assumptions have been made).  Once we have
>> that, we should document that old readers assume that the platform default
>> charset is consistent for readers and writers.
>> As and alternative, for new files we could add add a new stream ID, so the
>> old readers skip them.
>> -dain

View raw message