kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ewen Cheslack-Postava (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4353) Add semantic types to Kafka Connect
Date Mon, 07 Nov 2016 23:45:58 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645840#comment-15645840

Ewen Cheslack-Postava commented on KAFKA-4353:

[~rhauch] Some of these make sense to me, others don't as much. UUID is an example that I
think most programming languages have as a built-in now, so probably makes more sense as a
native type (although interestingly, I would have represented it as bytes, not in string form).
JSON might be a good example of the opposite, where if you're really intent on not passing
it through Connect (and it'd be painful for every Converter to have to also support JSON),
then I agree just naming the type should be enough.

There's a bit more to my concern around a large # of logical types than just Converters having
to support them. The good thing w/ Converters is that there are bound to be relatively few
of them, so while adding more types is annoying, it's not the end of the world. But if there
are 40 specialized types, do we actually think connectors are commonly going to be able to
do something useful with them? I just worry about having 15 different types for time since
most systems in practice only have a couple (the fact that you're looking at CDC is probably
why you're seeing a lot more, but there it doesn't look to me like there's actually a lot
of overlap).

I think this is just a matter of impedance mismatch between different systems and how far
we think it makes sense to bend over backwards to preserve as much info as possible vs where
reasonable compromises can be made that make the story for Converter/Connector developers
sane (and, frankly, users since once the data exits connect, they presumably need to understand
all the types that can be emitted as well).

I think the idea of semantic types makes sense -- we wanted to be able to name types for exactly
this reason (beyond even these close-to-primitive types). You can of course do this already
with your own names, I think you're just trying to get coordination between source and sink
connectors (and maybe other applications if they maintain & know to look at the schema
name) since you'd prefer not to do this with debezium-specific names? Will all of the ones
you listed actually make sense for applications? Take MicroTime vs NanoTime as an example
-- they end up eating up the same storage anyway, would it make sense to just do it all as
NanoTime (whereas MilliTimestamp and MicroTimestamp cover different possible ranges of time).

It might also make sense to try to get some feedback from the community as to which of these
they'd use (and which might be missing, including logical types). It's a lot more compelling
to hear that a dozen connectors are providing UUID as just a string because they don't have
a named type.

> Add semantic types to Kafka Connect
> -----------------------------------
>                 Key: KAFKA-4353
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4353
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions:
>            Reporter: Randall Hauch
>            Assignee: Ewen Cheslack-Postava
> Kafka Connect's schema system defines several _core types_ that consist of:
> * MAP
> plus these _primitive types_:
> * INT8
> * INT16
> * INT32
> * INT64
> * FLOAT32
> * FLOAT64
> The {{Schema}} for these core types define several attributes, but they do not have a
> Kafka Connect also defines several _logical types_ that are specializations of the primitive
types and _do_ have schema names _and_ are automatically mapped to/from Java objects:
> || Schema Name || Primitive Type || Java value class || Description ||
> | o.k.c.d.Decimal | {{BYTES}} | {{java.math.BigDecimal}} | An arbitrary-precision signed
decimal number. |
> | o.k.c.d.Date | {{INT32}} | {{java.util.Date}} | A date representing a calendar day
with no time of day or timezone. The {{java.util.Date}} value's hours, minutes, seconds, milliseconds
are set to 0. The underlying representation is an integer representing the number of standardized
days (based on a number of milliseconds with 24 hours/day, 60 minutes/hour, 60 seconds/minute,
1000 milliseconds/second with n) since Unix epoch. |
> | o.k.c.d.Time | {{INT32}} | {{java.util.Date}} | A time representing a specific point
in a day, not tied to any specific date. Only the {{java.util.Date}} value's hours, minutes,
seconds, and milliseconds can be non-zero. This effectively makes it a point in time during
the first day after the Unix epoch. The underlying representation is an integer representing
the number of milliseconds after midnight. |
> | o.k.c.d.Timestamp | {{INT32}} | {{java.util.Date}} | A timestamp representing an absolute
time, without timezone information. The underlying representation is a long representing the
number of milliseconds since Unix epoch. |
> where "o.k.c.d" is short for {{org.kafka.connect.data}}. [~ewencp] has stated in the
past that adding more logical types is challenging and generally undesirable, since everyone
use Kafka Connect values have to deal with all new logical types.
> This proposal adds standard _semantic_ types that are somewhere between the core types
and logical types. Basically, they are just predefined schemas that have names and are based
on other primitive types. However, there is no mapping to another form other than the primitive.
> The purpose of semantic types is to provide hints as to how the values _can_ be treated.
Of course, clients are free to ignore the hints of some or all of the built-in semantic types,
and in these cases would treat the values as the primitive value with no extra semantics.
This behavior makes it much easier to add new semantic types over time without risking incompatibilities.
> Really, any source connector can define custom semantic types, but there is tremendous
value in having a library of standard, well-known semantic types, including:
> || Schema Name || Primitive Type || Description ||
> | o.k.c.d.Uuid | {{STRING}} | A UUID in string form.|
> | o.k.c.d.Json | {{STRING}} | A JSON document, array, or scalar in string form.|
> | o.k.c.d.Xml | {{STRING}} | An XML document in string form.|
> | o.k.c.d.BitSet | {{STRING}} | A string of zero or more {{0}} or {{1}} characters.|
> | o.k.c.d.ZonedTime | {{STRING}} | An ISO-8601 formatted representation of a time (with
fractional seconds) with timezone or offset from UTC.|
> | o.k.c.d.ZonedTimestamp | {{STRING}} | An ISO-8601 formatted representation of a timestamp
with timezone or offset from UTC.|
> | o.k.c.d.EpochDays | {{INT64}} | A date with no time or timezone information, represented
as the number of days since (or before) epoch, or January 1, 1970, at 00:00:00UTC.|
> | o.k.c.d.Year | {{INT32}} | The year number.|
> | o.k.c.d.MilliTime | {{INT32}} | Number of milliseconds past midnight.|
> | o.k.c.d.MicroTime | {{INT64}} | Number of microseconds past midnight.|
> | o.k.c.d.NanoTime | {{INT64}} | Number of nanoseconds past midnight.|
> | o.k.c.d.MilliTimestamp | {{INT64}} | Number of milliseconds past epoch.|
> | o.k.c.d.MicroTimestamp | {{INT64}} | Number of microseconds past epoch.|

This message was sent by Atlassian JIRA

View raw message