kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randall Hauch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4353) Add semantic types to Kafka Connect
Date Tue, 08 Nov 2016 16:16:58 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647999#comment-15647999

Randall Hauch commented on KAFKA-4353:

Logical types and semantic types are not the same thing, and they don't carry the same weight.
The point of semantic types is not so much that every programming language has constructs
for them, but rather that a *source* accessed by a connector has this concept and wants to
capture it. Whether or not consumers choose or will do anything with this extra semantic information
is beside the point, because as soon as its available then consumers *can* do something with
it. In this way, semantic types are very different than logical types that build into the
converters the conversion logic to and from programming language constructs.

Sure, source connector can define their own semantic type by simply creating a schema based
upon a primitive and giving it a name. Debezium is doing precisely this for JSON, XML, UUIDs,
and temporal types so that its source connectors can include as much information as possible
about the data captured in the event messages. The problem with this is that sink connectors
written by other communities or organizations are not likely to know about Debezium's semantic
types. The bottom line is that having some standard semantic types will mean that more connectors
are developed to support them, and that people can much more easily mix and match source and
sink connectors.

JSON is an excellent example. Source connectors can capture that {{STRING}} fields are in
fact JSON documents, arrays, or scalars, and sink connectors pushing data into systems that
*do* have some notion of JSON could take the {{STRING}} values and parse them into JSON representation
before using them. I conceded that it's maybe not useful to have lots of similar temporal
semantic types with different units, but at a minimum I do think it is useful to have semantic
types for year, days, and ISO 8601 timestamps. 

Really, semantic types are just a convention of using the existing schema system but with
well-known schema names. Perhaps it's less useful for Kafka Connect software to define the
few constants and trivial utility methods, and more useful to treat it as a protocol that
multiple organizations can collaborate on and support.

> Add semantic types to Kafka Connect
> -----------------------------------
>                 Key: KAFKA-4353
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4353
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions:
>            Reporter: Randall Hauch
>            Assignee: Ewen Cheslack-Postava
> Kafka Connect's schema system defines several _core types_ that consist of:
> * MAP
> plus these _primitive types_:
> * INT8
> * INT16
> * INT32
> * INT64
> * FLOAT32
> * FLOAT64
> The {{Schema}} for these core types define several attributes, but they do not have a
> Kafka Connect also defines several _logical types_ that are specializations of the primitive
types and _do_ have schema names _and_ are automatically mapped to/from Java objects:
> || Schema Name || Primitive Type || Java value class || Description ||
> | o.k.c.d.Decimal | {{BYTES}} | {{java.math.BigDecimal}} | An arbitrary-precision signed
decimal number. |
> | o.k.c.d.Date | {{INT32}} | {{java.util.Date}} | A date representing a calendar day
with no time of day or timezone. The {{java.util.Date}} value's hours, minutes, seconds, milliseconds
are set to 0. The underlying representation is an integer representing the number of standardized
days (based on a number of milliseconds with 24 hours/day, 60 minutes/hour, 60 seconds/minute,
1000 milliseconds/second with n) since Unix epoch. |
> | o.k.c.d.Time | {{INT32}} | {{java.util.Date}} | A time representing a specific point
in a day, not tied to any specific date. Only the {{java.util.Date}} value's hours, minutes,
seconds, and milliseconds can be non-zero. This effectively makes it a point in time during
the first day after the Unix epoch. The underlying representation is an integer representing
the number of milliseconds after midnight. |
> | o.k.c.d.Timestamp | {{INT32}} | {{java.util.Date}} | A timestamp representing an absolute
time, without timezone information. The underlying representation is a long representing the
number of milliseconds since Unix epoch. |
> where "o.k.c.d" is short for {{org.kafka.connect.data}}. [~ewencp] has stated in the
past that adding more logical types is challenging and generally undesirable, since everyone
use Kafka Connect values have to deal with all new logical types.
> This proposal adds standard _semantic_ types that are somewhere between the core types
and logical types. Basically, they are just predefined schemas that have names and are based
on other primitive types. However, there is no mapping to another form other than the primitive.
> The purpose of semantic types is to provide hints as to how the values _can_ be treated.
Of course, clients are free to ignore the hints of some or all of the built-in semantic types,
and in these cases would treat the values as the primitive value with no extra semantics.
This behavior makes it much easier to add new semantic types over time without risking incompatibilities.
> Really, any source connector can define custom semantic types, but there is tremendous
value in having a library of standard, well-known semantic types, including:
> || Schema Name || Primitive Type || Description ||
> | o.k.c.d.Uuid | {{STRING}} | A UUID in string form.|
> | o.k.c.d.Json | {{STRING}} | A JSON document, array, or scalar in string form.|
> | o.k.c.d.Xml | {{STRING}} | An XML document in string form.|
> | o.k.c.d.BitSet | {{STRING}} | A string of zero or more {{0}} or {{1}} characters.|
> | o.k.c.d.ZonedTime | {{STRING}} | An ISO-8601 formatted representation of a time (with
fractional seconds) with timezone or offset from UTC.|
> | o.k.c.d.ZonedTimestamp | {{STRING}} | An ISO-8601 formatted representation of a timestamp
with timezone or offset from UTC.|
> | o.k.c.d.EpochDays | {{INT64}} | A date with no time or timezone information, represented
as the number of days since (or before) epoch, or January 1, 1970, at 00:00:00UTC.|
> | o.k.c.d.Year | {{INT32}} | The year number.|
> | o.k.c.d.MilliTime | {{INT32}} | Number of milliseconds past midnight.|
> | o.k.c.d.MicroTime | {{INT64}} | Number of microseconds past midnight.|
> | o.k.c.d.NanoTime | {{INT64}} | Number of nanoseconds past midnight.|
> | o.k.c.d.MilliTimestamp | {{INT64}} | Number of milliseconds past epoch.|
> | o.k.c.d.MicroTimestamp | {{INT64}} | Number of microseconds past epoch.|

This message was sent by Atlassian JIRA

View raw message