avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Basjes (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AVRO-1704) Standardized format for encoding messages with Avro
Date Fri, 11 Mar 2016 13:03:11 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190866#comment-15190866
] 

Niels Basjes edited comment on AVRO-1704 at 3/11/16 1:00 PM:
-------------------------------------------------------------

Thanks for pointing this out. 

My updated proposal for this:
{code}Avro<version><fingerprint><record>{code}
Where 
# "version" = 1 byte indicating the version (or "schema") of the rest of the bytes. 
if version == 0x00
# "Fingerprint" = the CRC-64-AVRO of the Canonical form of the Schema.
# "Record" = the record serialized to byte using the existing serialization system.

I personally do not like these 'chopped' prefixes if there is no "really good reason to chop
them" (like the length). 
Because the projects name is so short: In this proposal I'm sticking to using the full name
of the project as the prefix: "Avro" (i.e. these 4 bytes 0x41, 0x76, 0x72, 0x6F)



was (Author: nielsbasjes):
Thanks for pointing this out. 

My updated proposal for this:
{code}"Avro"<version><fingerprint><record>{code}
Where 
# "version" = 1 byte indicating the version (or "schema") of the rest of the bytes. 
if version == 0x00
# "Fingerprint" = the CRC-64-AVRO of the Canonical form of the Schema.
# "Record" = the record serialized to byte using the existing serialization system.

I personally do not like these 'chopped' prefixes if there is no "really good reason to chop
them" (like the length). 
Because the projects name is so short: In this proposal I'm sticking to using the full name
of the project as the prefix: "Avro" (i.e. these 4 bytes 0x41, 0x76, 0x72, 0x6F)


> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are written to Kafka
and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync markers and other
metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, meaning that
I can read and write data with minimal effort across the various languages in use in my organization.
If there was a standardized format for encoding single values that was optimized for out-of-band
schema transfer, I would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode datums in this
format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums.
The reader would decode the fingerprint and ask its SchemaStore to return the corresponding
writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed library users
to inject custom backends. A simple, file system based one could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message