orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: "For dictionary encodings the dictionary is sorted"
Date Tue, 06 Jun 2017 04:53:08 GMT

> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <dain@iq80.com> wrote:
> On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omalley@apache.org> wrote:
>>> I think this should also be documented in the statistics section which
>> also uses UTF-16 BE, which is at least consistent, but still annoying for
>> everything other than Java.
>> Yes, it should be documented and we should replace it with UTF-8. (Although
>> changes to the serialized form are always painful.)
> I think we can do something similar to the bloom filter code, where we add a StringUtf8Stats
object and have a transition period where we can produce both.

I was looking at the change proto changes to TimestampStatistics, and I think the same thing
could work here.  We add:

    optional string minimumUtf8 = 4;
    optional string maximumUtf8 = 5;

and the update the writer write just the UTF-8 version (or both during a transition).

View raw message