avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jarrad, Ken " <ken.jar...@citi.com>
Subject RE: Alternative to Avro container files for long-term Avro storage
Date Tue, 15 Nov 2016 13:44:00 GMT

My understanding of Avro containers is that they include the schema (self-contained) and the
reader will get the schema from the container.
I use this technique for Kafka, not Avro containers, so I avoid the problem of ‘sealing’
the schema inside the container, but I need to publish the schema for use by others.

Appending a new type of message probably requires duplication of an existing container.
Avro unions are backward compatible when appending a new type.
That allows my Kafka clients to read older messages with newer unions.


From: Josh [mailto:jofo90@gmail.com]
Sent: 15 November 2016 12:46
To: user@avro.apache.org
Subject: Re: Alternative to Avro container files for long-term Avro storage

Hi Ken,

Thanks for the reply - that does sound like a good idea, however I don't think it will work
well for me - as I don't have a fixed number of message types. In my case there could potentially
be new message types added every day and the union could grow to contain hundreds of message
types. It also sounds tricky to manage the union when adding new message types. (i.e. making
sure readers' schemas are updated first)

If there's a nice way to do it, I'd like to find a way that doesn't involve Avro container
files, so that I can maintain a separate Avro schema per message type.


On Tue, Nov 15, 2016 at 12:21 PM, Jarrad, Ken <ken.jarrad@citi.com<mailto:ken.jarrad@citi.com>>
Josh, I use method createUnion on class org.apache.avro.Schema.

The mixed message types then have the union as their common type and are thus homogeneous.

Yours sincerely,
Ken Jarrad.

From: Josh [mailto:jofo90@gmail.com<mailto:jofo90@gmail.com>]
Sent: 15 November 2016 10:24
To: user@avro.apache.org<mailto:user@avro.apache.org>
Subject: Alternative to Avro container files for long-term Avro storage

Hi all,

I am using a typical Avro->Kafka solution where data is serialized to Avro before it gets
written to Kafka and each message is prepended with a schema ID which can be looked up in
my schema repository.

Now, I want to store the data in long-term storage by writing data from Kafka->S3.

I know that the usual way to store Avro in storage is using Avro container files, however
a container file can only contain messages encoded with a single Avro schema. In my case,
the messages may be encoded with difference schemas, and I need to retain the order of the
messages (so that they can be replayed into Kafka, in order). Therefore, a single file in
S3 needs to contain messages encoded with different schemas and so I can't use Avro container

I was wondering what would be a good solution to this? What format could I use to store my
Avro data, such that a single data file can contain messages encoded with different schemas?
Should I store the messages with a prepended schema ID, similar to what I do in Kafka? In
that case, how could I separate the messages in the file?

Thanks for any advice,

View raw message