orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: "For dictionary encodings the dictionary is sorted"
Date Tue, 06 Jun 2017 22:39:41 GMT
I'm confused. TimestampStatistics uses integers not strings.

.. Owen

On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom <dain@iq80.com> wrote:

>
> > On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <dain@iq80.com> wrote:
> > On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omalley@apache.org> wrote:
> >>> I think this should also be documented in the statistics section which
> >> also uses UTF-16 BE, which is at least consistent, but still annoying
> for
> >> everything other than Java.
> >>
> >> Yes, it should be documented and we should replace it with UTF-8.
> (Although
> >> changes to the serialized form are always painful.)
> >
> > I think we can do something similar to the bloom filter code, where we
> add a StringUtf8Stats object and have a transition period where we can
> produce both.
>
> I was looking at the change proto changes to TimestampStatistics, and I
> think the same thing could work here.  We add:
>
>     optional string minimumUtf8 = 4;
>     optional string maximumUtf8 = 5;
>
> and the update the writer write just the UTF-8 version (or both during a
> transition).
>
> -dain

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message