avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: serialization stability when using Avro objects as 'headers' at the start of a longer stream
Date Wed, 15 Jan 2014 18:43:22 GMT
Unless you're certain that the header schema will never change and/or
that the reading and writing code will always have the same exact
version (i.e., data will not be persisted or transmitted over the
network) I would suggest that you also include some kind of version or
magic number at the start of the stream to permit you to evolve the
format of the header.

For example, a simple approach might be to have the initial four bytes
of MyFormat version 1 might be something like, ['M','F','0','1'].
Then, in your code, you might have a table like:

static final Schema[] SCHEMA_VERSIONS = { MyHeaderSchema };

When you read a stream you can find its schema in this table.

Then when you modify the header schema you can add the old schema to
the table.  This permits you to evolve the header schema.

There are lots of other ways to do this with various tradeoffs.  The
schemas could be stored in a database, you might use the Schema's
fingerprint instead of a version number, you could even put the entire
schema at the beginning of every stream.  Regardless, for any
non-ephemeral format, it's best to have the first few bytes identify
the format.


On Wed, Jan 15, 2014 at 12:15 AM, Sid Shetye <sid314@outlook.com> wrote:
> From a deserialization stability perspective, how safe is it to have an Avro serialized
object at the start of a byte stream? Let's assume the rest of the stream, after this Avro
serialized object, is filled with application layer data which can be anything from zero byte
to a few hundred megabytes?
> Essentially using the Avro object as a header and the "body" being a byte stream. To
illustrate via a made-up case:
> Offset - Data
> ==========
> 0x0000 - 1st byte of MyAvroHeader (serialized by Avro)
> ...
> 0x001F - Last byte of MyAvroHeader (serialized by Avro)
> 0x0020 - 1st byte of MyAppStream
> ...              // (bytes/offsets continue till end-of-stream is reached)
> I did a quick and simple serialization-only test (no IPC/RPC) using the C# version of
Avro and this seems to work well. However, I wanted to hear from others if there are some
issues with this approach.
> Regards
> Sid

View raw message