avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Blue (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
Date Sun, 17 Jul 2016 22:46:20 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381561#comment-15381561
] 

Ryan Blue commented on AVRO-1704:
---------------------------------

I think this should be abstract. The format that we're adding solves one set of uses, but
the utility methods have value beyond that. Encoding a single Avro record is fairly common,
but the implementations vary widely in quality because it is difficult to find the right setup
of DatumWriter, BinaryEncoder, and ByteArrayOutputStream. Simplifying and improving applications
that already do this is a good thing. And some of those uses, like the case I mentioned where
we're embedding Avro in Parquet records, don't need the header or schema at all because that's
defined in the file metadata.

The abstraction is also useful for transitioning to the format we're defining here. The normal
way to encode messages in Kafka is the 8-byte fingerprint followed by the encoded message
payload. With the abstraction, you can write a decoder that checks for the header and then
deserializes, or assumes the old format if the header is missing. That would enable rolling
upgrades using the same Kafka topics, rather than needing a hard transition.

I would also include the abstraction in case we want to change or introduce a new format later.

bq. I also worry that names like BinaryDatumDecoder

I've pushed a new commit that moves the classes to org.apache.avro.message and renames them
to MessageEncoder and MessageDecoder. I think used "encoder" instead of "reader" to contrast
with the DatumReader and DatumWriter, since there is little difference between a datum and
a message (a datum to encode by itself).

bq. Perhaps [the reusable i/o straems] should go in the util package so they can be used more
widely?

I've moved them there. I avoided it before so that they weren't added to the public API, but
I think it's fine to make them available.

bq. We might also add utilities for generic & reflect, like, model#getMessageWriter(Schema)?

I looked at this, but then the GenericData classes would have both createDatumWriter and getMessageWriter,
which looks confusing to me. Keeping the DatumEncoder above the level of the data models helps
separate the DatumWriter from the MessageEncoder.

If we want to make instantiating these easier, then maybe a builder would be more appropriate.
That would allow us to pass multiple writer schemas to the MessageDecoder.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>             Fix For: 1.9.0, 1.8.3
>
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are written to Kafka
and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync markers and other
metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, meaning that
I can read and write data with minimal effort across the various languages in use in my organization.
If there was a standardized format for encoding single values that was optimized for out-of-band
schema transfer, I would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode datums in this
format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums.
The reader would decode the fingerprint and ask its SchemaStore to return the corresponding
writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed library users
to inject custom backends. A simple, file system based one could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message