kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ewen Cheslack-Postava (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
Date Fri, 27 May 2016 07:00:20 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303666#comment-15303666

Ewen Cheslack-Postava commented on KAFKA-3744:

Just to second [~ijuma]'s comments, this absolutely needs a KIP. "Affects the format" doesn't
quite capture the requirements for a KIP. Even things that affect semantics but don't strictly
affect format are subject to KIPs. The end result of the KIP could be that it doesn't affect
older clients that simply ignore those bits, but its still really important to have that discussion
and make sure that's an acceptable path.

Re: the specific proposal, I'm skeptical. Magic bytes are a *very* common approach for format
detection and don't require any specialized support, are used by a lot of people today, and
seems to work fine in practice. From my reading, the proposal also assumes that key and value
serialization is the same, which it turns out is not the case for many users (and I have found
this in practice a lot based on issues filed against Confluent's REST proxy where people want
simple serialization for keys, e.g. UTF8 strings, and complex serialization for values, e.g.
GenericRecords). Formats like JSON are the main exception here re: magic bytes. My impression
is that folks that actually think about multiple formats realize up front that you need magic
bytes and include it. If you use something like JSON, you tend to track this somehow externally
such that you know based on topics what format you're using. I'm not convinced of the benefit

> Message format needs to identify serializer
> -------------------------------------------
>                 Key: KAFKA-3744
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3744
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: David Kay
>            Priority: Minor
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new users. Section
1.3 Step 4 says: "Send some messages" and takes lines of text from the command line. Beginner's
guide (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign Slide 104
> {noformat}
>    Kafka does not care about data format of msg payload
>    Up to developer to handle serialization/deserialization
>       Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a third producer
sends JSON, and a fourth sends CBOR, how does the consumer identify which deserializer to
use for the payload?  The commit includes an opaque K byte Key that could potentially include
a codec identifier, but provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great deal of progress
being made on serialization libraries right now, and any particular choice is unlikely to
be right for all uses. Needless to say a particular application using Kafka would likely mandate
a particular serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a single mime-type
for all web content.  There must be a way to signal the serialization used to produce this
message's V byte payload, and documenting the existence of even a rudimentary codec registry
with a few values (text, Avro, JSON, CBOR) would establish the pattern to be used for future
serialization libraries.

This message was sent by Atlassian JIRA

View raw message