orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: "For dictionary encodings the dictionary is sorted"
Date Tue, 13 Dec 2016 00:48:21 GMT
On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omalley@apache.org> wrote:
>> Is it a requirement that the dictionary be sorted or a suggestion?
> It is a requirement, although we can discuss weakening it.
> The SargApplier doesn't currently use the sorted nature of the
> dictionaries, but it should. In particular, it should map sarg predicates
> for strings into the dictionary entries using binary search.

In that case we should definitely document the sort order for the dictionary items.

> The problem with sorting the dictionary is of course that it makes the
> writer keep all of the values deserialized until the end of the stripe.
> I've considered using a secondary stream that stores the sort order of each
> dictionary item. Thoughts?

You will need the uncompressed values in memory to perform the lookup in the hash table (the
equals call).

>> I believe the current implementation is using Java String
> No, the dictionary has always used UTF-8.

I meant that the sorting of the dictionary seems to be UTF-16 BE.  Is that not correct?

>> I think this should also be documented in the statistics section which
> also uses UTF-16 BE, which is at least consistent, but still annoying for
> everything other than Java.
> Yes, it should be documented and we should replace it with UTF-8. (Although
> changes to the serialized form are always painful.)

I think we can do something similar to the bloom filter code, where we add a StringUtf8Stats
object and have a transition period where we can produce both.

View raw message