orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: Bloom filter hash broken
Date Wed, 07 Sep 2016 20:08:45 GMT
To expand on Prasanth's answer, in ORC we have both a format version, which
is oldest version of the reader that can read the file (eg 0.11 and 0.12),
and the writer version, which keeps track of which version of the software
that wrote the file denoted by the jiras where there are significant
changes in the writer (eg. original, hive-8732, hive-4243, hive-12055,
hive-13083, and now orc-101). The reader uses the writer version to work
around issues like this.

.. Owen

On Wed, Sep 7, 2016 at 12:08 PM, Prasanth Jayachandran <
j.prasanth.j@gmail.com> wrote:

> +1 to bump up the writer version to facilitate correct ppd for older
> versions.
> Alan - PPD will have to look at the writer version to detect old files.
> Newer files will have writer version as ORC-101.
>
> Thanks
> Prasanth
>
>
>
>
> On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <alanfgates@gmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
> I think using the default encoding for the old files is the best option,
> as it will be right 99% of the time.  I was wondering how the system would
> know whether or not this was an old file.
>
> Alan.
>
> > On Sep 7, 2016, at 10:06, Owen O'Malley  wrote:
> >
> > 4 is about when you are using the bloom filter for predicate push down.
> I'm
> > saying old files should use the default encoding when checking the bloom
> > filter. The other option is to always have the predicate push down say
> > maybe if the file is an old one.
> >
> > .. Owen
> >
> > On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates  wrote:
> >
> >> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
> >> encoding and use that?  Is there a versioning concept in the bloom
> filters
> >> that will make it easy to determine if this is pre or post ORC-101?
> >>
> >> Alan.
> >>
> >>> On Sep 7, 2016, at 08:57, Owen O'Malley  wrote:
> >>>
> >>> All,
> >>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
> >>> filters are currently using the default character encoding. That makes
> >> the
> >>> bloom filters non-portable between different computers that use
> different
> >>> default encodings. I've filed ORC-101 to address it, but I want to
> have a
> >>> wider discussion. I'd propose that we:
> >>>
> >>> 1. create a new WriterVersion for ORC-101.
> >>> 2. move the bloom filter code from storage-api into ORC.
> >>> 3. consistently use UTF-8 when creating new bloom filters
> >>> 4. for ORC files older than ORC-101, test the default encoding instead
> of
> >>> UTF-8
> >>>
> >>> Thoughts?
> >>>
> >>> .. Owen
> >>
> >>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message