avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: A case for adding revision field to Avro schema
Date Tue, 21 Sep 2010 17:55:40 GMT
On 09/21/2010 05:18 AM, Thiruvalluvan M. G. wrote:
> Here is a design that would improve things a bit more. Instead of
> serializing the object against its actual schema, let's say the application
> serializes against a union schema in which the object type's schema is a
> branch. As the application evolves, the application simply adds a branch to
> union.

Where would this union be stored?  Is it only stored in the application, 
or is it stored with the data?  I think it would be safest to somehow 
store it with the dataset, not in the application.

> While reading the object, the application expects for one branch but
> the serialized object might be using another branch. As long as the branches
> "match", Avro would resolve correctly. The current Java generic writer can
> correctly pick the branch as long as the object's schema is one of the
> branches. The nice thing about this improved design is that, there is no
> need to store a separate schema "pointer" along with the object. The
> "union-index" essentially acts as the pointer and it is internal to Avro.

It sounds like perhaps you're trying to optimize the size of the pointer 
from each stored instance to its schema.  Is that correct?  If so, then 
one might simply use a table for this.  The application stores 
<pointer,record> pairs, but pointers need not be 16-byte checksums, but 
could be variable-length integers, starting from zero, that, for most 
applications, would always fit in a single byte.

If schemas are stored with the dataset, then they could be stored as either:
  - the standalone single schema for every item in the dataset, which 
happens to be a union schema that's managed in a particular way, adding 
a new entry to the end each time an instance of a new schema is written; or
  - a table of schemas, whose indices are used as pointers in each 
datum, with entries added when no existing entry matches a datum to be 
written.

The two are isomorphic.  The former uses more Avro logic but feels more 
fragile.  It's not really an arbitrary schema, but a union that takes 
advantage of the way that unions are serialized.  The latter feels to me 
like a clearer description of a dataset.  In either case the application 
must manage the table of schemas.  The only operation that's simplified 
is that the top-level union dispatch at read and possibly write would 
use Avro logic instead of application logic.  At write you might even be 
tempted to bypass Avro logic, since, in maintaining the union, you'd 
know the branch already, and searching for the right branch might be 
more costly.

> But there is one problem. As per the Avro specification, in order to "match"
> two schemas of the same type should have the same name. But two schemas with
> the same type and name cannot be branches within a union. Thus the design
> above will not work.

The problem with multiple union branches of the same name only arises at 
write time, not at read time.  So, if we allowed multiple branches of 
the same name in a top-level union at read time then this might work.

A way to address this might be through aliases.  If, in the union, each 
branch but the last, the record has a versioned name, i.e., the union is 
["r0", "r1", .., "r"], then writing would work.  If "r" then has aliases 
of ["r0", "r1", ..], then, at read-time, the union would be rewritten as 
["r", "r", ...], but where each branch has a different definition. 
Currently this would fail due to the duplicate names, but if we changed 
it that so that, in the context of alias rewrites while reading, we 
permit duplicate names in a top-level union, then this could work as 
desired.

Doug


Mime
View raw message