kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
Date Sun, 25 Feb 2018 07:22:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375972#comment-16375972
] 

ASF GitHub Bot commented on KAFKA-3744:
---------------------------------------

hachikuji closed pull request #1419: KAFKA-3744: Allocate 2 attribute bits to signal payload
format
URL: https://github.com/apache/kafka/pull/1419
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/implementation.html b/docs/implementation.html
index 16ba07a456c..aa58d8c740a 100644
--- a/docs/implementation.html
+++ b/docs/implementation.html
@@ -146,12 +146,24 @@ <h3><a id="messages" href="#messages">5.3 Messages</a></h3>
 <p>
 Messages consist of a fixed-size header, a variable length opaque key byte array and a variable
length opaque value byte array. The header contains the following fields:
 <ul>
-    <li> A CRC32 checksum to detect corruption or truncation. <li/>
+    <li> A CRC32 checksum to detect corruption or truncation. </li>
     <li> A format version. </li>
     <li> An attributes identifier </li>
     <li> A timestamp </li>
 </ul>
-Leaving the key and value opaque is the right decision: there is a great deal of progress
being made on serialization libraries right now, and any particular choice is unlikely to
be right for all uses. Needless to say a particular application using Kafka would likely mandate
a particular serialization type as part of its usage. The <code>MessageSet</code>
interface is simply an iterator over messages with specialized methods for bulk reading and
writing to an NIO <code>Channel</code>.
+Leaving the key and payload mostly opaque is the right decision: there is a great deal of
progress being made on serialization libraries right now, and any particular choice is unlikely
to be right for all uses. But to facilitate interoperability two attribute bits are defined
as a serialization selector:
+<ul>
+  <li>0 and 1 specify two payload encodings (text and avro-binary); key format is unspecified.</li>
+  <li>2 specifies that the key must be a JSON object with a property "t" containing
a
+<a href="http://www.iana.org/assignments/media-types/media-types.xhtml">media-type</a>
string
+registered with IANA.  For example, key <pre>  {"t":"application/cbor"}</pre>
specifies that the
+payload is serialized using Concise Binary Object Representation, RFC 7049. The JSON object
in key
+may contain an arbitrary set of additional properties.  Using media-type allows payloads
of any
+registered format (e.g., image/jpeg, application/pdf) to be specified.</li>
+  <li>3 is reserved; key and payload formats are unspecified.</ul>
+</ul>
+
+<code>MessageSet</code> interface is simply an iterator over messages with specialized
methods for bulk reading and writing to an NIO <code>Channel</code>.
 
 <h3><a id="messageformat" href="#messageformat">5.4 Message Format</a></h3>
 
@@ -165,10 +177,16 @@ <h3><a id="messageformat" href="#messageformat">5.4 Message
Format</a></h3>
      *      1 : gzip
      *      2 : snappy
      *      3 : lz4
+     *      4~7 : reserved
      *    bit 3 : Timestamp type
      *      0 : create time
      *      1 : log append time
-     *    bit 4 ~ 7 : reserved
+     *    bit 4 ~ 5 : Serialization
+     *      0 : key: opaque, payload: text/plain
+     *      1 : key: opaque, payload: avro-binary
+     *      2 : key: json object, payload: media-type specified by property "t"
+     *      3 : reserved (key: opaque, payload: opaque)
+     *    bit 6 ~ 7 : reserved
      * 4. (Optional) 8 byte timestamp only if "magic" identifier is greater than 0
      * 5. 4 byte key length, containing length K
      * 6. K byte key
@@ -195,8 +213,8 @@ <h3><a id="log" href="#log">5.5 Log</a></h3>
 timestamp      : 8 bytes (Only exists when magic value is greater than zero)
 key length     : 4 bytes
 key            : K bytes
-value length   : 4 bytes
-value          : V bytes
+payload length : 4 bytes
+payload        : V bytes
 </pre>
 <p>
 The use of the message offset as the message id is unusual. Our original idea was to use
a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker.
But since a consumer must maintain an ID for each server, the global uniqueness of the GUID
provides no value. Furthermore the complexity of maintaining the mapping from a random id
to an offset requires a heavy weight index structure which must be synchronized with disk,
essentially requiring a full persistent random-access data structure. Thus to simplify the
lookup structure we decided to use a simple per-partition atomic counter which could be coupled
with the partition id and node id to uniquely identify a message; this makes the lookup structure
simpler, though multiple seeks per consumer request are still likely. However once we settled
on a counter, the jump to directly using the offset seemed natural&mdash;both after all
are monotonically increasing integers unique to a partition. Since the offset is hidden from
the consumer API this decision is ultimately an implementation detail and we went with the
more efficient approach.


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Message format needs to identify serializer
> -------------------------------------------
>
>                 Key: KAFKA-3744
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3744
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: David Kay
>            Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new users. Section
1.3 Step 4 says: "Send some messages" and takes lines of text from the command line. Beginner's
guide (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign Slide 104
says:
> {noformat}
>    Kafka does not care about data format of msg payload
>    Up to developer to handle serialization/deserialization
>       Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a third producer
sends JSON, and a fourth sends CBOR, how does the consumer identify which deserializer to
use for the payload?  The commit includes an opaque K byte Key that could potentially include
a codec identifier, but provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great deal of progress
being made on serialization libraries right now, and any particular choice is unlikely to
be right for all uses. Needless to say a particular application using Kafka would likely mandate
a particular serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a single mime-type
for all web content.  There must be a way to signal the serialization used to produce this
message's V byte payload, and documenting the existence of even a rudimentary codec registry
with a few values (text, Avro, JSON, CBOR) would establish the pattern to be used for future
serialization libraries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message