orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: Bloom filter hash broken
Date Thu, 08 Sep 2016 18:02:50 GMT
> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <omalley@apache.org> wrote:
> 
> Ok, Prasanth found a problem with my proposed approach. In particular, the
> old readers would misinterpret bloom filters from new files. Therefore, I'd
> like to propose a more complicated solution:
> 1. We extend the stripe footer or bloom filter index to record the default
> encoding when we are writing a string or decimal bloom filter.
> 2. When reading a bloom filter, we use the encoding if it is present.

Does that mean that you always write with he platform encoding?  This would make using bloom
filters for read in other programming languages difficult because you would need to do a UTF_8
to some arbitrary character encoding.  This will also make using these bloom filters in performance
critical sections (join loops) computationally expensive as you have to do a transcode.

Also, I think the spec need to be clarified.  The spec does not state the character encoding
of the bloom filters.  I assumed it was UTF_8 to match the normal string column encoding.
 It looks like the spec does not document the meaning of "the version of the writer” and
what workarounds are necessary (or operating assumptions have been made).  Once we have that,
we should document that old readers assume that the platform default charset is consistent
for readers and writers. 

As and alternative, for new files we could add add a new stream ID, so the old readers skip
them.

-dain
Mime
View raw message