Mailing-List: contact dev-help@orc.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@orc.apache.org
MIME-Version: 1.0
In-Reply-To: <2A0CB33A-E109-4B3E-8518-3F6727E61532@iq80.com>
References: <CAHfHakHsVPF44+jJP1+XyVghAnQMcOPi_EMTXA-QSY6z6sQkcA@mail.gmail.com>
 <0367F280-9955-49AF-8E8C-EB4623B64341@gmail.com> <CAHfHakEgqOgMSMJBX5VBx_GNosa3C=WeV+zUUDOxPa4xk7PYLg@mail.gmail.com>
 <582F71B9-2F58-45D4-9376-36FE85D96720@gmail.com> <222D48259388A7CE.9E77F8F2-AA9B-4168-931F-444F9A4F148D@mail.outlook.com>
 <CAHfHakGR7vLAjRf8hqLtSFqR_GgLiH+FLHNmz=xsvWnRdDZe1g@mail.gmail.com>
 <CAHfHakHeig5r6w8NP0ghQdEM3SJ6n9Bro9F0s_XOmRUuSyXq8Q@mail.gmail.com> <2A0CB33A-E109-4B3E-8518-3F6727E61532@iq80.com>
From: "Owen O'Malley" <omalley@apache.org>
Date: Thu, 8 Sep 2016 15:42:57 -0700
Message-ID: <CAHfHakGF18To5GP3y5h1d1kJ2OQvE8GoZB6iXwhQyt19O0k-Sg@mail.gmail.com>
Subject: Re: Bloom filter hash broken
To: dev@orc.apache.org
Content-Type: multipart/alternative; boundary=001a114d7d94571882053c06c094
archived-at: Thu, 08 Sep 2016 22:43:00 -0000

--001a114d7d94571882053c06c094
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Dain,
   That is a great point. I wasn't thinking about having to implement that
in C++, where it would really suck. (It was hard enough dealing with the
timezones in C++, so I should know better!) I got the approach from this
morning working and pushed it as a branch
https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552969395e=
5f5
.

  The heaviest hammer would be to just create a new file version, but that
seems like overkill for this. (Although sooner or later we will get there,
since we need new encodings for decimal.)

  Ok, so how about:

1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8
based bloom filters. It will be used for strings, chars, varchars, and
decimal. (Hopefully all charsets use the same bytes for ascii characters,
but I don't want to find the strange exceptions.)

2. We create a new config knob that let's you write both BLOOM_FILTER and
BLOOM_FILTER_UTF8 for users while they are transitioning.

3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to BLOOM_FILTER
if it is an old file.

Thoughts?

.. Owen

On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <dain@iq80.com> wrote:

> > On Sep 8, 2016, at 9:59 AM, Owen O'Malley <omalley@apache.org> wrote:
> >
> > Ok, Prasanth found a problem with my proposed approach. In particular,
> the
> > old readers would misinterpret bloom filters from new files. Therefore,
> I'd
> > like to propose a more complicated solution:
> > 1. We extend the stripe footer or bloom filter index to record the
> default
> > encoding when we are writing a string or decimal bloom filter.
> > 2. When reading a bloom filter, we use the encoding if it is present.
>
> Does that mean that you always write with he platform encoding?  This
> would make using bloom filters for read in other programming languages
> difficult because you would need to do a UTF_8 to some arbitrary characte=
r
> encoding.  This will also make using these bloom filters in performance
> critical sections (join loops) computationally expensive as you have to d=
o
> a transcode.
>
> Also, I think the spec need to be clarified.  The spec does not state the
> character encoding of the bloom filters.  I assumed it was UTF_8 to match
> the normal string column encoding.  It looks like the spec does not
> document the meaning of "the version of the writer=E2=80=9D and what work=
arounds
> are necessary (or operating assumptions have been made).  Once we have
> that, we should document that old readers assume that the platform defaul=
t
> charset is consistent for readers and writers.
>
> As and alternative, for new files we could add add a new stream ID, so th=
e
> old readers skip them.
>
> -dain

--001a114d7d94571882053c06c094--