avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Blue (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
Date Sun, 10 Jul 2016 01:31:11 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369378#comment-15369378
] 

Ryan Blue commented on AVRO-1704:
---------------------------------

[~cutting], I've pushed a couple new commits to the pull request. The changes include:
* Add ReusableByteBufferInputStream and ReusableByteArrayInputStream
* Make the encoder and decoder instances thread-safe
* Remove the thread-local encoder from Specific because the static encoder and decoder are
now thread-safe
* Add tests using generic

That addresses the review feedback other than the question of whether to use an interface
or an abstract class. I think the patch has the best of both options by including both an
interface and an abstract base class (DatumDecoder.BaseDecoder) that implementations can use
to cut down on boilerplate and maintain compatibility. That leaves the choice up to the implementer.
If you have a strong opinion here, I can change it but I think having both is a good solution.

Also, some of the tests are ignored because they don't pass without a modification to the
ResolvingGrammarGenerator. Aliases don't appear to be working. I'm opening another issue with
a patch for it.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>             Fix For: 1.9.0, 1.8.3
>
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are written to Kafka
and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync markers and other
metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, meaning that
I can read and write data with minimal effort across the various languages in use in my organization.
If there was a standardized format for encoding single values that was optimized for out-of-band
schema transfer, I would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode datums in this
format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums.
The reader would decode the fingerprint and ask its SchemaStore to return the corresponding
writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed library users
to inject custom backends. A simple, file system based one could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message