orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: Bloom filter hash broken
Date Fri, 09 Sep 2016 17:48:08 GMT
One more wrinkle on this that I just discovered is that I inadvertently
fixed the problem for strings (but not decimals) as part of HIVE-12055,
which was released as part of Hive 2.1. So that means we currently have two
versions out there:

WriterVersion      String charset  Decimal charset
< HIVE-12055      jvm default      jvm default
>= HIVE-12055    UTF-8             jvm default

So I'm going ahead with the BLOOM_FILTER_UTF8 stream (and optionally the
old BLOOM_FILTER stream), but I'll make the reader recognize the
WriterVersion >= HIVE-12055 and trust BLOOM_FILTER for string bloom
filters, but not the decimal ones.

.. Owen


On Thu, Sep 8, 2016 at 7:25 PM, Dain Sundstrom <dain@iq80.com> wrote:

> Sounds good to me.
>
> Should we add a version field to the BLOOM_FILTER_UTF8 to deal with any
> future problems?
>
> One other thought, in the protobuf definition I think it would be more
> efficient to have the bitset encoded as a byte[] to avoid the boxed long
> array.
>
> -dain
>
> > On Sep 8, 2016, at 3:42 PM, Owen O'Malley <omalley@apache.org> wrote:
> >
> > Dain,
> >   That is a great point. I wasn't thinking about having to implement that
> > in C++, where it would really suck. (It was hard enough dealing with the
> > timezones in C++, so I should know better!) I got the approach from this
> > morning working and pushed it as a branch
> > https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552
> 969395e5f5
> > .
> >
> >  The heaviest hammer would be to just create a new file version, but that
> > seems like overkill for this. (Although sooner or later we will get
> there,
> > since we need new encodings for decimal.)
> >
> >  Ok, so how about:
> >
> > 1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8
> > based bloom filters. It will be used for strings, chars, varchars, and
> > decimal. (Hopefully all charsets use the same bytes for ascii characters,
> > but I don't want to find the strange exceptions.)
> >
> > 2. We create a new config knob that let's you write both BLOOM_FILTER and
> > BLOOM_FILTER_UTF8 for users while they are transitioning.
> >
> > 3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to
> BLOOM_FILTER
> > if it is an old file.
> >
> > Thoughts?
> >
> > .. Owen
> >
> > On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <dain@iq80.com> wrote:
> >
> >>> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <omalley@apache.org> wrote:
> >>>
> >>> Ok, Prasanth found a problem with my proposed approach. In particular,
> >> the
> >>> old readers would misinterpret bloom filters from new files. Therefore,
> >> I'd
> >>> like to propose a more complicated solution:
> >>> 1. We extend the stripe footer or bloom filter index to record the
> >> default
> >>> encoding when we are writing a string or decimal bloom filter.
> >>> 2. When reading a bloom filter, we use the encoding if it is present.
> >>
> >> Does that mean that you always write with he platform encoding?  This
> >> would make using bloom filters for read in other programming languages
> >> difficult because you would need to do a UTF_8 to some arbitrary
> character
> >> encoding.  This will also make using these bloom filters in performance
> >> critical sections (join loops) computationally expensive as you have to
> do
> >> a transcode.
> >>
> >> Also, I think the spec need to be clarified.  The spec does not state
> the
> >> character encoding of the bloom filters.  I assumed it was UTF_8 to
> match
> >> the normal string column encoding.  It looks like the spec does not
> >> document the meaning of "the version of the writer” and what workarounds
> >> are necessary (or operating assumptions have been made).  Once we have
> >> that, we should document that old readers assume that the platform
> default
> >> charset is consistent for readers and writers.
> >>
> >> As and alternative, for new files we could add add a new stream ID, so
> the
> >> old readers skip them.
> >>
> >> -dain
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message