Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A42B7200B8D for ; Fri, 9 Sep 2016 00:43:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A2B68160AD0; Thu, 8 Sep 2016 22:43:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EBE2D160AAD for ; Fri, 9 Sep 2016 00:42:59 +0200 (CEST) Received: (qmail 24702 invoked by uid 500); 8 Sep 2016 22:42:59 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 24691 invoked by uid 99); 8 Sep 2016 22:42:59 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Sep 2016 22:42:59 +0000 Received: from mail-ua0-f178.google.com (mail-ua0-f178.google.com [209.85.217.178]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id D1E641A05A7 for ; Thu, 8 Sep 2016 22:42:58 +0000 (UTC) Received: by mail-ua0-f178.google.com with SMTP id 31so53683811uao.0 for ; Thu, 08 Sep 2016 15:42:58 -0700 (PDT) X-Gm-Message-State: AE9vXwOyEBc8BagZu4jM4bzefcQEb8FMF4uNivYoDUtbKhtfrIYu7BkbA0GRahKYxzxsTuc7nXV7eC/KDza7Fw== X-Received: by 10.176.3.203 with SMTP id 69mr274348uau.9.1473374577954; Thu, 08 Sep 2016 15:42:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.159.33.225 with HTTP; Thu, 8 Sep 2016 15:42:57 -0700 (PDT) In-Reply-To: <2A0CB33A-E109-4B3E-8518-3F6727E61532@iq80.com> References: <0367F280-9955-49AF-8E8C-EB4623B64341@gmail.com> <582F71B9-2F58-45D4-9376-36FE85D96720@gmail.com> <222D48259388A7CE.9E77F8F2-AA9B-4168-931F-444F9A4F148D@mail.outlook.com> <2A0CB33A-E109-4B3E-8518-3F6727E61532@iq80.com> From: "Owen O'Malley" Date: Thu, 8 Sep 2016 15:42:57 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Bloom filter hash broken To: dev@orc.apache.org Content-Type: multipart/alternative; boundary=001a114d7d94571882053c06c094 archived-at: Thu, 08 Sep 2016 22:43:00 -0000 --001a114d7d94571882053c06c094 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Dain, That is a great point. I wasn't thinking about having to implement that in C++, where it would really suck. (It was hard enough dealing with the timezones in C++, so I should know better!) I got the approach from this morning working and pushed it as a branch https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552969395e= 5f5 . The heaviest hammer would be to just create a new file version, but that seems like overkill for this. (Although sooner or later we will get there, since we need new encodings for decimal.) Ok, so how about: 1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8 based bloom filters. It will be used for strings, chars, varchars, and decimal. (Hopefully all charsets use the same bytes for ascii characters, but I don't want to find the strange exceptions.) 2. We create a new config knob that let's you write both BLOOM_FILTER and BLOOM_FILTER_UTF8 for users while they are transitioning. 3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to BLOOM_FILTER if it is an old file. Thoughts? .. Owen On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom wrote: > > On Sep 8, 2016, at 9:59 AM, Owen O'Malley wrote: > > > > Ok, Prasanth found a problem with my proposed approach. In particular, > the > > old readers would misinterpret bloom filters from new files. Therefore, > I'd > > like to propose a more complicated solution: > > 1. We extend the stripe footer or bloom filter index to record the > default > > encoding when we are writing a string or decimal bloom filter. > > 2. When reading a bloom filter, we use the encoding if it is present. > > Does that mean that you always write with he platform encoding? This > would make using bloom filters for read in other programming languages > difficult because you would need to do a UTF_8 to some arbitrary characte= r > encoding. This will also make using these bloom filters in performance > critical sections (join loops) computationally expensive as you have to d= o > a transcode. > > Also, I think the spec need to be clarified. The spec does not state the > character encoding of the bloom filters. I assumed it was UTF_8 to match > the normal string column encoding. It looks like the spec does not > document the meaning of "the version of the writer=E2=80=9D and what work= arounds > are necessary (or operating assumptions have been made). Once we have > that, we should document that old readers assume that the platform defaul= t > charset is consistent for readers and writers. > > As and alternative, for new files we could add add a new stream ID, so th= e > old readers skip them. > > -dain --001a114d7d94571882053c06c094--