kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: INT128 Column Support Interest
Date Tue, 21 Nov 2017 00:51:11 GMT
On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke <ghenke@cloudera.com> wrote:

> Thank you for the feedback. Below are some responses.
>
> Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> > Presto, etc? What type would we map to in Java?
>
>
> In Java we would Map to a BigInteger. Their isn't a perfectly natural
> mapping for SQL that I know of. It has been mentioned in the past that we
> could have server side flags to disable/enable the ability to create
> columns of certain types to prevent users from creating tables that are not
> readable by certain integrations. This problem exists today with the BINARY
> column type.
>

I'm somewhat against such a configuration. This being a server-side
configuration results in Kudu deployments in different environments having
different sets of available types, which seems very difficult for
downstream users to deal with. Even though "least common denominator" kind
of sucks, it's also not a bad policy for software that aims to be part of a
pretty diverse ecosystem.



>
> > Why not just _not_ expose it and only expose decimal.
>
>
> Technically decimal only supports 28 9's where INT128 can support slightly
> larger numbers. Their may also be more overhead dealing with a decimal
> type. Though I am not positive about that.
>

I think without clear user demand for >28 digits it's just not worth the
complexity.


>
> Encoders: like Dan mentioned, it seems like we might not be able to do a
> > very efficient job of encoding these very large integers. Stuff like
> > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > values. So, I'm a little afraid that we'll end up only with PLAIN and
> > people will be upset with the storage overhead and performance.
>
>
>  Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
>
>
> We will need to ensure performant encoding exists for INT128 to make
> decimals with a precisions >= 18 work well anyway. We should likely have
> parity
> with the other integer types to reduce any confusion about differing
> precisions having different encoding considerations. Although Presto
> documents that precision >= 18 are slower than the others. We could do
> something similar and follow on with improvements.
>
> In the current int128 internal patch I know that the RLE doesn't work for
> int128. I don't have a lot of background on Kudu's encoding details, so
> investigating encodings further is one of my next steps.
>

That's a good point. However, I'm guessing that users are more likely to
intuitively know that "9 digits is enough" more easily than they will know
that "64 bits is enough". In my experience people underestimate the range
of 64-bit integers and might choose INT128 if available even if they have
no need for anywhere near that range.

-Todd


>
> On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <danburkert@apache.org>
> wrote:
>
> > Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
> >
> > - Dan
> >
> > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <todd@cloudera.com> wrote:
> >
> >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <danburkert@apache.org>
> >> wrote:
> >>
> >> > I think it would be useful.  As far as I've seen the main costs in
> >> > carrying data types are in writing performant encoders, and updating
> >> > integrations to work with them.  I'm guessing with 128 bit integers
> >> there
> >> > would be some integrations that can't or won't support it, which might
> >> be a
> >> > cause for confusion.  Overall, though, I think the upsides of
> efficiency
> >> > and decreased storage space are compelling.   Do you have a sense yet
> of
> >> > what encodings are going to be supported down the road (will we get to
> >> full
> >> > parity with 32/64)?
> >> >
> >>
> >> Yea, my concerns are:
> >>
> >> 1) Integrations: do we have a compatible SQL type to map this to in
> Spark
> >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems
> like
> >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
> >> we're going to map it the same as decimal anyway, why not just _not_
> >> expose
> >> it and only expose decimal? If someone wants to store a 128-bit hash as
> a
> >> DECIMAL(39) they are free to, of course. Postgres's built-in int types
> >> only
> >> go up to 64-bit (bigint)
> >>
> >> In addition to the choice of DECIMAL, for things like fixed-length
> binary
> >> maybe we are better off later adding a fixed-length BINARY type, like
> >> BINARY(16) which could be used for storing large hashes? There is
> >> precedent
> >> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
> >>
> >>
> >> 2) Encoders: like Dan mentioned, it seems like we might not be able to
> do
> >> a
> >> very efficient job of encoding these very large integers. Stuff like
> >> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> >> values. So, I'm a little afraid that we'll end up only with PLAIN and
> >> people will be upset with the storage overhead and performance.
> >>
> >> -Todd
> >>
> >> >
> >> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <ghenke@cloudera.com>
> >> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> >> >> internal support for 128 bit integers. Taking that one step further
> and
> >> >> supporting public columns and APIs for 128 bit integers would not be
> >> too
> >> >> much additional work. However, I wanted to gauge the interest from
> the
> >> >> community.
> >> >>
> >> >> My initial thoughts are that having an INT128 column type could be
> >> useful
> >> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> >> types
> >> >> of data.
> >> >>
> >> >> Is there any interest or uses for a INT128 column type? Is anyone
> >> >> currently using a STRING or BINARY column for 128 bit data?
> >> >>
> >> >> Thank you,
> >> >> Grant
> >> >> --
> >> >> Grant Henke
> >> >> Software Engineer | Cloudera
> >> >> grant@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> >> >>
> >> >
> >> >
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message