kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Henke <ghe...@cloudera.com>
Subject Re: INT128 Column Support Interest
Date Mon, 20 Nov 2017 21:12:37 GMT
Thank you for the feedback. Below are some responses.

Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> Presto, etc? What type would we map to in Java?


In Java we would Map to a BigInteger. Their isn't a perfectly natural
mapping for SQL that I know of. It has been mentioned in the past that we
could have server side flags to disable/enable the ability to create
columns of certain types to prevent users from creating tables that are not
readable by certain integrations. This problem exists today with the BINARY
column type.

Why not just _not_ expose it and only expose decimal.


Technically decimal only supports 28 9's where INT128 can support slightly
larger numbers. Their may also be more overhead dealing with a decimal
type. Though I am not positive about that.

Encoders: like Dan mentioned, it seems like we might not be able to do a
> very efficient job of encoding these very large integers. Stuff like
> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> values. So, I'm a little afraid that we'll end up only with PLAIN and
> people will be upset with the storage overhead and performance.


 Aren't we going to need efficient encodings in order to make decimal work
> well, anyway?


We will need to ensure performant encoding exists for INT128 to make
decimals with a precisions >= 18 work well anyway. We should likely have parity
with the other integer types to reduce any confusion about differing
precisions having different encoding considerations. Although Presto
documents that precision >= 18 are slower than the others. We could do
something similar and follow on with improvements.

In the current int128 internal patch I know that the RLE doesn't work for
int128. I don't have a lot of background on Kudu's encoding details, so
investigating encodings further is one of my next steps.

Thank you,
Grant





On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <danburkert@apache.org> wrote:

> Aren't we going to need efficient encodings in order to make decimal work
> well, anyway?
>
> - Dan
>
> On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <danburkert@apache.org>
>> wrote:
>>
>> > I think it would be useful.  As far as I've seen the main costs in
>> > carrying data types are in writing performant encoders, and updating
>> > integrations to work with them.  I'm guessing with 128 bit integers
>> there
>> > would be some integrations that can't or won't support it, which might
>> be a
>> > cause for confusion.  Overall, though, I think the upsides of efficiency
>> > and decreased storage space are compelling.   Do you have a sense yet of
>> > what encodings are going to be supported down the road (will we get to
>> full
>> > parity with 32/64)?
>> >
>>
>> Yea, my concerns are:
>>
>> 1) Integrations: do we have a compatible SQL type to map this to in Spark
>> SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
>> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
>> we're going to map it the same as decimal anyway, why not just _not_
>> expose
>> it and only expose decimal? If someone wants to store a 128-bit hash as a
>> DECIMAL(39) they are free to, of course. Postgres's built-in int types
>> only
>> go up to 64-bit (bigint)
>>
>> In addition to the choice of DECIMAL, for things like fixed-length binary
>> maybe we are better off later adding a fixed-length BINARY type, like
>> BINARY(16) which could be used for storing large hashes? There is
>> precedent
>> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
>>
>>
>> 2) Encoders: like Dan mentioned, it seems like we might not be able to do
>> a
>> very efficient job of encoding these very large integers. Stuff like
>> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
>> values. So, I'm a little afraid that we'll end up only with PLAIN and
>> people will be upset with the storage overhead and performance.
>>
>> -Todd
>>
>> >
>> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <ghenke@cloudera.com>
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> As a part of adding DECIMAL support to Kudu it was necessary to add
>> >> internal support for 128 bit integers. Taking that one step further and
>> >> supporting public columns and APIs for 128 bit integers would not be
>> too
>> >> much additional work. However, I wanted to gauge the interest from the
>> >> community.
>> >>
>> >> My initial thoughts are that having an INT128 column type could be
>> useful
>> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
>> types
>> >> of data.
>> >>
>> >> Is there any interest or uses for a INT128 column type? Is anyone
>> >> currently using a STRING or BINARY column for 128 bit data?
>> >>
>> >> Thank you,
>> >> Grant
>> >> --
>> >> Grant Henke
>> >> Software Engineer | Cloudera
>> >> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >>
>> >
>> >
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Mime
View raw message