orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: "For dictionary encodings the dictionary is sorted"
Date Tue, 13 Dec 2016 00:36:00 GMT
> Is it a requirement that the dictionary be sorted or a suggestion?

It is a requirement, although we can discuss weakening it.

The SargApplier doesn't currently use the sorted nature of the
dictionaries, but it should. In particular, it should map sarg predicates
for strings into the dictionary entries using binary search.

The problem with sorting the dictionary is of course that it makes the
writer keep all of the values deserialized until the end of the stripe.
I've considered using a secondary stream that stores the sort order of each
dictionary item. Thoughts?

> I believe the current implementation is using Java String

No, the dictionary has always used UTF-8.

> I think this should also be documented in the statistics section which
also uses UTF-16 BE, which is at least consistent, but still annoying for
everything other than Java.

Yes, it should be documented and we should replace it with UTF-8. (Although
changes to the serialized form are always painful.)

.. Owen



On Sun, Dec 11, 2016 at 4:19 PM, Dain Sundstrom <dain@iq80.com> wrote:

> Hi all,
>
> Quick question about the ORC spec.  In the character types encodings
> section (https://orc.apache.org/docs/encodings.html), it says:
>
>   For dictionary encodings the dictionary is sorted and UTF-8 bytes of
> each unique value are placed into DICTIONARY_DATA.
>
> Is it a requirement that the dictionary be sorted or a suggestion?
>
> I don’t see any code that takes advantage of this and I believe that this
> is only an effort to improve compression of the dictionary.  If it is a
> requirement, the collation order should be documented.  I believe the
> current implementation is using Java String natural ordering which is
> UTF-16 big endian, which is a bit confusing since the dictionary is UTF-8
> encoded.
>
> As a side note, I think this should also be documented in the statistics
> section which also uses UTF-16 BE, which is at least consistent, but still
> annoying for everything other than Java.
>
> Thanks,
>
> -dain

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message