avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: Avro union compatibility mode enhancement proposal
Date Sat, 16 Apr 2016 20:36:47 GMT
Matthieu, how would that work with enums? I think it may make sense for
enums to add a special value for it.

rb

On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <monsch@alum.mit.edu>
wrote:

> > I think that substituting different data for
> > unrecognized branches in a union isn't the way to fix the problem, but I
> > have a proposal for a way to fix it that looks more like you'd expect in
> > the OO example above: by adding the Vehicle class to your read schema.
> >
> > Right now, unions are first resolved by full class name as required by
> the
> > spec. But after that, we have some additional rules to match schemas.
> These
> > rules are how we reconcile name differences from situations like writing
> > with a generic class and reading with a specific class. I'm proposing you
> > use a catch-all class (the superclass) with fields that are in all of the
> > union's branches, and we update schema resolution to allow it.
>
>
> That sounds good. The only thing I’d add is to make this catch-all
> behavior explicit in the schema (similar to how field defaults must be
> explicitly added).
> To help fix another common writer evolution issue, we could also add a
> similar catch-all for `enum`s (optional, to be explicitly specified in the
> schema).
> -Matthieu
>
>
>
> > On Apr 12, 2016, at 2:41 PM, Ryan Blue <rblue@netflix.com.INVALID>
> wrote:
> >
> > Yacine,
> >
> > Thanks for the extra information. Sorry for my delay in replying, I
> wanted
> > to time think about your suggestion.
> >
> > I see what you mean that you can think of a union as the superclass of
> its
> > options. The reflect object model has test code that does just that,
> where
> > classes B and C inherit from A and the schema for A is created as a union
> > of B and C. But, I don't think that your suggestion aligns with the
> > expectations of object oriented design. Maybe that's an easier way to
> > present my concern:
> >
> > Say I have a class, Vehicle, with subclasses Car and Truck. I have
> > applications that work with my dataset, the vehicles that my company
> owns,
> > and we buy a bus. If I add a Bus class, what normally happens is that an
> > older application can work with it. A maintenance tracker would call
> > getVehicles and can still get the lastMaintenanceDate for my Bus, even
> > though it doesn't know about busses. But what you suggest is that it is
> > replaced with a default, say null, in cases like this.
> >
> > I think that the problem is that Avro no equivalent concept of
> inheritance.
> > There's only one way to represent it for what you need right now, like
> > Matthieu suggested. I think that substituting different data for
> > unrecognized branches in a union isn't the way to fix the problem, but I
> > have a proposal for a way to fix it that looks more like you'd expect in
> > the OO example above: by adding the Vehicle class to your read schema.
> >
> > Right now, unions are first resolved by full class name as required by
> the
> > spec. But after that, we have some additional rules to match schemas.
> These
> > rules are how we reconcile name differences from situations like writing
> > with a generic class and reading with a specific class. I'm proposing you
> > use a catch-all class (the superclass) with fields that are in all of the
> > union's branches, and we update schema resolution to allow it. So you'd
> > have...
> >
> > record Vehicle {
> >  long lastMaintenanceDate;
> > }
> >
> > record Car {
> >  long lastMaintenanceDate;
> >  int maxChildSeats;
> > }
> >
> > record Truck {
> >  long lastMainetanceDate;
> >  long lastWashedDate;
> > }
> >
> > Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The
> > catch-all class could provide the behavior you want, without silently
> > giving you incorrect data.
> >
> > Does that sound like a reasonable solution for your use case?
> >
> > rb
> >
> > On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <monsch@alum.mit.edu>
> > wrote:
> >
> >> Hi Yacine,
> >>
> >> I believe Ryan was mentioning that if you start from a schema `[“null”,
> >> “Car”]` then rather than add a new bus branch to the union, you could
> >> update the car’s schema to include new bus fields. For example (using
> IDL
> >> notation):
> >>
> >>> // Initial approach
> >>> // Evolved to union { null, Car, Bus }
> >>> record Car {
> >>>  string vin;
> >>> }
> >>> record Bus {
> >>>  string vin;
> >>>  int capacity;
> >>> }
> >>>
> >>> // Alternative approach
> >>> // Here evolution wouldn't change the union { null, Vehicle }
> >>> // Note also that this is actually a compatible evolution from the
> first
> >> approach, even under current rules (using aliases for name changes).
> >>> record Vehicle {
> >>>  string vin; // (Would be optional if not all branches had it.)
> >>>  union { null, int } capacity; // The new field is added here.
> >>> }
> >>
> >> Pushing this further, you could also directly start upfront with just a
> >> “Vehicle” record schema with an optional `car` field and add new
> optional
> >> fields as necessary (for example a `bus` field). This gives you much
> >> greater flexibility for evolving readers and writers separately (but you
> >> lose the ability to enforce that at most one field is ever present in
> the
> >> record).
> >>
> >> I agree that Avro’s current evolution rules are more geared towards
> reader
> >> evolution: evolving writers while keeping readers constant is difficult
> >> (namely because they don't support adding branches to unions or symbols
> to
> >> enums). However, adding a completely new and separate "compatibility
> mode”
> >> has a high complexity cost; both for implementors (the separate classes
> you
> >> mention) and users (who must invoke them specifically). It would be
> best to
> >> keep evolution rules universal.
> >>
> >> Maybe we could extend the logic behind field defaults? Enum schemas
> could
> >> simply contain a default symbol attribute. For unions, it’s a bit
> trickier,
> >> but we should be able to get around with a default branch attribute
> which
> >> would allow evolution if and only if the newly added branches can be
> read
> >> as the default branch. Assuming the attribute names don’t already exist,
> >> this would be compatible with the current behavior.
> >>
> >> I think this use-case is common enough that it’d be worth exploring (for
> >> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address this
> >> issue; though not completely unfortunately [2]).
> >>
> >> -Matthieu
> >>
> >> [1] https://github.com/linkedin/rest.li
> >> [2]
> >>
> https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible
> >>
> >>
> >>
> >>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane <
> >> ybenabderrahmane@octo.com> wrote:
> >>>
> >>> Hi Ryan,
> >>>
> >>> Just a little up^^. Could you please (or anyone else) give me a little
> >>> feedback ?
> >>> Thanks in advance.
> >>>
> >>> Regards,
> >>> Yacine.
> >>>
> >>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane <
> >> ybenabderrahmane@octo.com
> >>>> :
> >>>
> >>>> Hi Ryan,
> >>>>
> >>>> Thank you for giving feedback. I will try in the following to provide
> >> you
> >>>> some more details about the addressed problem.
> >>>>
> >>>> But before that, just a brief reminder of the context. Avro has been
> >>>> chosen in this project (and by many other ones for sure) especially
> for
> >> a
> >>>> very important feature: enabling forward and backward compatibility
> >>>> management through schema life-cycle. Our development model involves
> >>>> intensive usage of this feature, and many heavy developments are made
> in
> >>>> parallel streams inside feature teams that share the same schema,
> >> provided
> >>>> the evolution of the latter complies with the stated compatibility
> >> rules.
> >>>> This implies that all the entities supported by the Avro schema must
> >>>> support the two-way compatibility, including unions. However, in the
> >>>> special case of the union, this two-way compatibility is not well
> >> supported
> >>>> by the current rules. Let me explain you the basement of our point of
> >> view,
> >>>> it remains quite simple.
> >>>>
> >>>> The use case is to have, for example,
> >>>> - a first union version A:
> >>>>     { "name": "Vehicle",
> >>>>       "type": ["null", "Car"] }
> >>>> - a new version of it B:
> >>>>     { "name": "Vehicle",
> >>>>       "type": ["null", "Car", "Bus"] }
> >>>> For being forward compatible, an evolution of the union schema must
> >>>> guarantee that an old reader reading with A can read the data written
> >> with
> >>>> the new schema B. Getting an error just means that the forward
> >>>> compatibility feature is broken. But this is not actually the case
> (and
> >>>> this behavior is not suitable), because the old reader has a correct
> >> schema
> >>>> and this schema has evolved naturally to version B to incorporate a
> new
> >>>> Vehicle type. Not knowing this new type must not produce an error, but
> >> just
> >>>> give the reader a default value, which means: "Either the data is not
> >> there
> >>>> or you do not know how to handle it".
> >>>>
> >>>> This is thought while keeping in mind that in an object-oriented code
> >>>> modeling, a union field is seen as a class member with the higher
> level
> >>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, it
is
> >>>> natural for a modeler / programmer to implement the ability of not
> >> getting
> >>>> the awaited types and using some default value of known type. To give
> a
> >>>> more complete specification, the new mode of compatibility has to
> impose
> >>>> one rule: the union default value must not change through versions and
> >> the
> >>>> corresponding type must be placed at the top of the types list. This
> is
> >>>> much easier to handle by development streams, because it is addressed
> >> once
> >>>> for all in the very beginning of the schema life-cycle, than the fact
> to
> >>>> oblige a number of teams, among which some are just not in place
> >> anymore,
> >>>> to update the whole code just because another dev team has deployed
a
> >> new
> >>>> version of the union in the schema.
> >>>>
> >>>> Now, for being backward compatible, the reader with B must always be
> >> able
> >>>> to read data written with schema A. Even if the type included in the
> >> data
> >>>> is not known, so it gets the default value and not an error.
> >>>>
> >>>> I understand that getting an error could make sense when the requested
> >>>> field is not present. However, this behavior:
> >>>>
> >>>>  - is very restrictive, meaning: this obliges the old reader to update
> >>>>  its code for integrating the new schema, while he is not managing to
> >> do it
> >>>>  for many reasons: development stream of next delivery is not
> >> finished, or
> >>>>  not engaged, or not even planned - in the case of old and stable code
> >>>>  - breaks the forward compatibility feature: the older reader is not
> >>>>  able to read the new version of the union without getting an error
> >>>>  - breaks the backward compatibility feature: the new reader is not
> >>>>  able to read an old version containing unknown types of the union
> >> without
> >>>>  getting an error
> >>>>
> >>>> By the way, what do you exactly mean by "pushing evolution lower" and
> >>>> "update the record"? Could you please give me an example of the trick
> >> you
> >>>> are talking about?
> >>>>
> >>>> Just to be a bit more precise, we are not targeting to use a "trick".
> >> This
> >>>> life-cycle management should be included in a standard so to keep the
> >>>> software development clean, production safe and compliant with a
> complex
> >>>> product road-map.
> >>>>
> >>>> Finally, you seem to be concerned by the "significant change to the
> >>>> current evolution rules". Well, we actually do not change these rules,
> >> they
> >>>> keep just the same. All we are proposing is to introduce a *mode*
> where
> >>>> the rules of union compatibility change. This mode is materialized by
> a
> >>>> minimum and thin impact of the existing classes without any change in
> >> the
> >>>> behavior, all the logic of the new compatibility mode is implemented
> by
> >> new
> >>>> classes that must be invoked specifically. But you would better see
it
> >> in
> >>>> the code patch.
> >>>>
> >>>> Looking forward to reading your feedback and answers.
> >>>>
> >>>> Regards,
> >>>> Yacine.
> >>>>
> >>>>
> >>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <rblue@netflix.com.invalid>:
> >>>>
> >>>>> Hi Yacine,
> >>>>>
> >>>>> Thanks for the proposal. It sounds interesting, but I want to make
> sure
> >>>>> there's a clear use case for this because it's a significant change
> to
> >> the
> >>>>> current evolution rules. Right now we guarantee that a reader will
> get
> >> an
> >>>>> error if the data has an unknown union branch rather than getting
a
> >>>>> default
> >>>>> value. I think that makes sense: if the reader is requesting a field,
> >> it
> >>>>> should get the actual datum for it rather than a default because
it
> >>>>> doesn't
> >>>>> know how to handle it.
> >>>>>
> >>>>> Could you give us an example use case that requires this new logic?
> >>>>>
> >>>>> I just want to make sure we can't solve your problem another way.
For
> >>>>> example, pushing evolution lower in the schema usually does the
> trick:
> >>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1",
> >>>>> "RecordV2"], it is usually better to update the record so that older
> >>>>> readers can ignore the new fields.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> rb
> >>>>>
> >>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane <
> >>>>> ybenabderrahmane@octo.com> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> In order to provide a solution to the union schema evolution
> problem,
> >>>>> as it
> >>>>>> was earlier notified in the thread "add a type to a union
> >>>>>> <
> >>>>>>
> >>>>>
> >>
> http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union
> >>>>>>> "
> >>>>>> of the user mailing list, we decided, for the needs of the reactive
> >>>>>> architecture we have implemented for one of our clients, to
> implement
> >> an
> >>>>>> evolution of the compatibility principle of Avro when using
Unions.
> >> For
> >>>>>> reminder, the asked question was about the way to handle the
case
> >> where
> >>>>> a
> >>>>>> reader, using an old version of a schema that includes a union,
> reads
> >>>>> some
> >>>>>> data written with a new version of the schema where a type has
been
> >>>>> added
> >>>>>> to the union.
> >>>>>>
> >>>>>> As answered by Martin Kleppman in that thread, one way to handle
> this
> >>>>> kind
> >>>>>> of evolution (a new version of the schema adds a new type type
in a
> >>>>> union)
> >>>>>> would be to ensure that all the development streams have integrated
> >> the
> >>>>> new
> >>>>>> schema B before deploying it in the IT schema referential.
> >>>>>> However, in big structures involving strongly uncorrelated teams
(in
> >> the
> >>>>>> product life-cycle point of view), this approach appears to
be quite
> >>>>>> impracticable, causing production stream congestion, blocking
> behavior
> >>>>>> between teams, and a bunch of other
> >>>>>> unwanted-counter-agile-/-reactive-phenomena...
> >>>>>>
> >>>>>> Therefore, we had to implement a new *compatibility* *mode*
for the
> >>>>> unions,
> >>>>>> while taking care to comply with the following rules:
> >>>>>>
> >>>>>>  1. Clear rules of compatibility are stated and integrated for
this
> >>>>>>  compatibility mode
> >>>>>>  2. The standard Avro behavior must be kept intact
> >>>>>>  3. All the evolution implementation must be done without
> introducing
> >>>>> any
> >>>>>>  regression (all existing tests of the Avro stack must succeed)
> >>>>>>  4. The code impact on Avro stack must be minimized
> >>>>>>
> >>>>>> Just to give you a very brief overview (as I don't know if this
is
> >>>>> actually
> >>>>>> the place for a full detailed description), the evolution addresses
> >> the
> >>>>>> typical problem where two development streams use the same schema
> but
> >> in
> >>>>>> different versions, in the case described shortly as follows:
> >>>>>>
> >>>>>>  - The first development stream, called "DevA", uses the version
A
> of
> >>>>> a
> >>>>>>  schema which integrates a union referencing two types, say
"null"
> >> and
> >>>>>>  "string". The default value is set to null.
> >>>>>>  - The second development team, called "DevB", uses the version
B,
> >>>>> which
> >>>>>>  is an evolution of the version A, as it adds a reference to
a new
> >>>>> type
> >>>>>> in
> >>>>>>  the former union, say "long" (which makes it "null", string"
and
> >>>>> "long")
> >>>>>>  - When the schema B is deployed on the schema referential (in
our
> >>>>> case,
> >>>>>>  the IO Confluent Schema Registry) subsequently to the version
A
> >>>>>>     - The stream "DevA" must be able to read with schema A,
even if
> >>>>> the
> >>>>>>     data has been written using the schema B with the type "long"
in
> >>>>>> the union.
> >>>>>>     In the latter case, the read value is the union default
value
> >>>>>>     - The stream "DevB" must be able to read/write with schema
B,
> >>>>> even if
> >>>>>>     it writes the data using the type "long" in the union
> >>>>>>
> >>>>>> The evolution that we implemented for this mode includes some
rules
> >> that
> >>>>>> are based on the principles stated in the Avro documentation.
It is
> >> even
> >>>>>> more powerful than showed in the few lines above, as it enables
the
> >>>>> readers
> >>>>>> to get the default value of the union if the schema used for
reading
> >>>>> does
> >>>>>> not contain the type used by the writer in the union. This achieves
> a
> >>>>> new
> >>>>>> mode of forward / backward compatibility. This evolution is
for now
> >>>>> working
> >>>>>> perfectly, and should be on production in the few coming weeks.
We
> >> have
> >>>>>> also made an evolution of the IO Confluent Schema Registry stack
to
> >>>>> support
> >>>>>> it, again in a transparent manner (we also intend to contribute
to
> >> this
> >>>>>> stack in a second / parallel step).
> >>>>>>
> >>>>>> In the objective of contributing to the Avro stack with this
new
> >>>>>> compatibility mode for unions, I have some questions about the
> >>>>> procedure:
> >>>>>>
> >>>>>>  1. How can I achieve the contribution proposal? Should I directly
> >>>>>>  provide a patch in JIRA and dive into the details right there?
> >>>>>>  2. The base version of this evolution is 1.7.7, is it eligible
to
> >>>>>>  contribution evaluation anyway?
> >>>>>>
> >>>>>> Thanks in advance, looking forward to hearing from you and giving
> you
> >>>>> more
> >>>>>> details.
> >>>>>>
> >>>>>> Kind Regards,
> >>>>>> --
> >>>>>> *Yacine Benabderrahmane*
> >>>>>> Architect
> >>>>>> *OCTO Technology*
> >>>>>> <http://www.octo.com>
> >>>>>> -----------------------------------------------
> >>>>>> Tel : +33 6 10 88 25 98
> >>>>>> 50 avenue des Champs Elysées
> >>>>>> 75008 PARIS
> >>>>>> www.octo.com
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Ryan Blue
> >>>>> Software Engineer
> >>>>> Netflix
> >>>>>
> >>>>
> >>>>
> >>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message