avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Blue (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
Date Tue, 28 Jun 2016 21:18:57 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353737#comment-15353737

Ryan Blue commented on AVRO-1704:

I agree that the current interface is wide. I think we should have the datum reuse methods,
which doubles the API. I think we definitely want the ByteBuffer methods. Do you think we
don't need the InputStream methods? In the pull request there are also byte array methods,
but it's easy for callers to use ByteBuffer instead.

I like having the interface so that alternative implementations can be independent. There's
no guarantee that Avro's base class is useful to implementers and I don't see a need to force
people to inherit from an Avro class when it may not make sense. There's an optional base
class for convenience, so I think the benefits outweigh the cost.

+1 for getting rid of the performance pitfalls. I think we just need to find a reusable ByteArrayInputStream
and make sure we can change the buffer list in ByteBufferInputStream. I'll look into it.

For thread safety we can just make the reused state thread-local like you suggest. Right now
the Specific methods use a thread-local DatumEncoder/DatumDecoder. Do you think the DatumEncoder
implementations should be thread-safe?

I think we do need the raw format. Right now there are a lot of systems already serializing
Avro records in the equivalent of the raw format so I would like to have an Avro class that
helps move to the new spec. Also, if the schema is fixed then there's no need for 10 extra
bytes per payload so it is independently useful. For example, I use the raw format to store
JSON payloads. The schema won't change and Avro is much smaller and faster.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704-20160410.patch
> I'm currently using the Datafile format for encoding messages that are written to Kafka
and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync markers and other
metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, meaning that
I can read and write data with minimal effort across the various languages in use in my organization.
If there was a standardized format for encoding single values that was optimized for out-of-band
schema transfer, I would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode datums in this
format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums.
The reader would decode the fingerprint and ask its SchemaStore to return the corresponding
writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed library users
to inject custom backends. A simple, file system based one could be provided out of the box.

This message was sent by Atlassian JIRA

View raw message