avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: Avro union compatibility mode enhancement proposal
Date Sun, 17 Apr 2016 02:26:28 GMT
+1

On Sat, Apr 16, 2016 at 5:54 PM, Matthieu Monsch <monsch@alum.mit.edu>
wrote:

> > I think it may make sense for
> > enums to add a special value for it.
>
> That would work great. To give slightly more flexibility to users, we
> could even allow the reader’s schema to explicitly specify which symbol to
> use when it reads an unknown symbol. If not specified, resolution would
> fail (consistent with the current behavior).
>
> For example (assuming the corresponding key is called “onUnknown”):
>
> > {
> >   “type”: “enum”,
> >   “name”: “Platform”,
> >   “symbols”: [“UNSUPPORTED”, “ANDROID”, “IOS”],
> >   “onUnknown”: “UNSUPPORTED"
> > }
>
> This `enum` would then be able to resolve schemas with extra symbols (read
> as `UNSUPPORTED`).
>
>
>
> > On Apr 16, 2016, at 1:36 PM, Ryan Blue <rblue@netflix.com.INVALID>
> wrote:
> >
> > Matthieu, how would that work with enums? I think it may make sense for
> > enums to add a special value for it.
> >
> > rb
> >
> > On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <monsch@alum.mit.edu>
> > wrote:
> >
> >>> I think that substituting different data for
> >>> unrecognized branches in a union isn't the way to fix the problem, but
> I
> >>> have a proposal for a way to fix it that looks more like you'd expect
> in
> >>> the OO example above: by adding the Vehicle class to your read schema.
> >>>
> >>> Right now, unions are first resolved by full class name as required by
> >> the
> >>> spec. But after that, we have some additional rules to match schemas.
> >> These
> >>> rules are how we reconcile name differences from situations like
> writing
> >>> with a generic class and reading with a specific class. I'm proposing
> you
> >>> use a catch-all class (the superclass) with fields that are in all of
> the
> >>> union's branches, and we update schema resolution to allow it.
> >>
> >>
> >> That sounds good. The only thing I’d add is to make this catch-all
> >> behavior explicit in the schema (similar to how field defaults must be
> >> explicitly added).
> >> To help fix another common writer evolution issue, we could also add a
> >> similar catch-all for `enum`s (optional, to be explicitly specified in
> the
> >> schema).
> >> -Matthieu
> >>
> >>
> >>
> >>> On Apr 12, 2016, at 2:41 PM, Ryan Blue <rblue@netflix.com.INVALID>
> >> wrote:
> >>>
> >>> Yacine,
> >>>
> >>> Thanks for the extra information. Sorry for my delay in replying, I
> >> wanted
> >>> to time think about your suggestion.
> >>>
> >>> I see what you mean that you can think of a union as the superclass of
> >> its
> >>> options. The reflect object model has test code that does just that,
> >> where
> >>> classes B and C inherit from A and the schema for A is created as a
> union
> >>> of B and C. But, I don't think that your suggestion aligns with the
> >>> expectations of object oriented design. Maybe that's an easier way to
> >>> present my concern:
> >>>
> >>> Say I have a class, Vehicle, with subclasses Car and Truck. I have
> >>> applications that work with my dataset, the vehicles that my company
> >> owns,
> >>> and we buy a bus. If I add a Bus class, what normally happens is that
> an
> >>> older application can work with it. A maintenance tracker would call
> >>> getVehicles and can still get the lastMaintenanceDate for my Bus, even
> >>> though it doesn't know about busses. But what you suggest is that it is
> >>> replaced with a default, say null, in cases like this.
> >>>
> >>> I think that the problem is that Avro no equivalent concept of
> >> inheritance.
> >>> There's only one way to represent it for what you need right now, like
> >>> Matthieu suggested. I think that substituting different data for
> >>> unrecognized branches in a union isn't the way to fix the problem, but
> I
> >>> have a proposal for a way to fix it that looks more like you'd expect
> in
> >>> the OO example above: by adding the Vehicle class to your read schema.
> >>>
> >>> Right now, unions are first resolved by full class name as required by
> >> the
> >>> spec. But after that, we have some additional rules to match schemas.
> >> These
> >>> rules are how we reconcile name differences from situations like
> writing
> >>> with a generic class and reading with a specific class. I'm proposing
> you
> >>> use a catch-all class (the superclass) with fields that are in all of
> the
> >>> union's branches, and we update schema resolution to allow it. So you'd
> >>> have...
> >>>
> >>> record Vehicle {
> >>> long lastMaintenanceDate;
> >>> }
> >>>
> >>> record Car {
> >>> long lastMaintenanceDate;
> >>> int maxChildSeats;
> >>> }
> >>>
> >>> record Truck {
> >>> long lastMainetanceDate;
> >>> long lastWashedDate;
> >>> }
> >>>
> >>> Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The
> >>> catch-all class could provide the behavior you want, without silently
> >>> giving you incorrect data.
> >>>
> >>> Does that sound like a reasonable solution for your use case?
> >>>
> >>> rb
> >>>
> >>> On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <monsch@alum.mit.edu
> >
> >>> wrote:
> >>>
> >>>> Hi Yacine,
> >>>>
> >>>> I believe Ryan was mentioning that if you start from a schema
> `[“null”,
> >>>> “Car”]` then rather than add a new bus branch to the union, you
could
> >>>> update the car’s schema to include new bus fields. For example (using
> >> IDL
> >>>> notation):
> >>>>
> >>>>> // Initial approach
> >>>>> // Evolved to union { null, Car, Bus }
> >>>>> record Car {
> >>>>> string vin;
> >>>>> }
> >>>>> record Bus {
> >>>>> string vin;
> >>>>> int capacity;
> >>>>> }
> >>>>>
> >>>>> // Alternative approach
> >>>>> // Here evolution wouldn't change the union { null, Vehicle }
> >>>>> // Note also that this is actually a compatible evolution from the
> >> first
> >>>> approach, even under current rules (using aliases for name changes).
> >>>>> record Vehicle {
> >>>>> string vin; // (Would be optional if not all branches had it.)
> >>>>> union { null, int } capacity; // The new field is added here.
> >>>>> }
> >>>>
> >>>> Pushing this further, you could also directly start upfront with just
> a
> >>>> “Vehicle” record schema with an optional `car` field and add new
> >> optional
> >>>> fields as necessary (for example a `bus` field). This gives you much
> >>>> greater flexibility for evolving readers and writers separately (but
> you
> >>>> lose the ability to enforce that at most one field is ever present in
> >> the
> >>>> record).
> >>>>
> >>>> I agree that Avro’s current evolution rules are more geared towards
> >> reader
> >>>> evolution: evolving writers while keeping readers constant is
> difficult
> >>>> (namely because they don't support adding branches to unions or
> symbols
> >> to
> >>>> enums). However, adding a completely new and separate "compatibility
> >> mode”
> >>>> has a high complexity cost; both for implementors (the separate
> classes
> >> you
> >>>> mention) and users (who must invoke them specifically). It would be
> >> best to
> >>>> keep evolution rules universal.
> >>>>
> >>>> Maybe we could extend the logic behind field defaults? Enum schemas
> >> could
> >>>> simply contain a default symbol attribute. For unions, it’s a bit
> >> trickier,
> >>>> but we should be able to get around with a default branch attribute
> >> which
> >>>> would allow evolution if and only if the newly added branches can be
> >> read
> >>>> as the default branch. Assuming the attribute names don’t already
> exist,
> >>>> this would be compatible with the current behavior.
> >>>>
> >>>> I think this use-case is common enough that it’d be worth exploring
> (for
> >>>> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address
> this
> >>>> issue; though not completely unfortunately [2]).
> >>>>
> >>>> -Matthieu
> >>>>
> >>>> [1] https://github.com/linkedin/rest.li
> >>>> [2]
> >>>>
> >>
> https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible
> >>>>
> >>>>
> >>>>
> >>>>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane <
> >>>> ybenabderrahmane@octo.com> wrote:
> >>>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> Just a little up^^. Could you please (or anyone else) give me a
> little
> >>>>> feedback ?
> >>>>> Thanks in advance.
> >>>>>
> >>>>> Regards,
> >>>>> Yacine.
> >>>>>
> >>>>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane <
> >>>> ybenabderrahmane@octo.com
> >>>>>> :
> >>>>>
> >>>>>> Hi Ryan,
> >>>>>>
> >>>>>> Thank you for giving feedback. I will try in the following to
> provide
> >>>> you
> >>>>>> some more details about the addressed problem.
> >>>>>>
> >>>>>> But before that, just a brief reminder of the context. Avro
has been
> >>>>>> chosen in this project (and by many other ones for sure) especially
> >> for
> >>>> a
> >>>>>> very important feature: enabling forward and backward compatibility
> >>>>>> management through schema life-cycle. Our development model
involves
> >>>>>> intensive usage of this feature, and many heavy developments
are
> made
> >> in
> >>>>>> parallel streams inside feature teams that share the same schema,
> >>>> provided
> >>>>>> the evolution of the latter complies with the stated compatibility
> >>>> rules.
> >>>>>> This implies that all the entities supported by the Avro schema
must
> >>>>>> support the two-way compatibility, including unions. However,
in the
> >>>>>> special case of the union, this two-way compatibility is not
well
> >>>> supported
> >>>>>> by the current rules. Let me explain you the basement of our
point
> of
> >>>> view,
> >>>>>> it remains quite simple.
> >>>>>>
> >>>>>> The use case is to have, for example,
> >>>>>> - a first union version A:
> >>>>>>    { "name": "Vehicle",
> >>>>>>      "type": ["null", "Car"] }
> >>>>>> - a new version of it B:
> >>>>>>    { "name": "Vehicle",
> >>>>>>      "type": ["null", "Car", "Bus"] }
> >>>>>> For being forward compatible, an evolution of the union schema
must
> >>>>>> guarantee that an old reader reading with A can read the data
> written
> >>>> with
> >>>>>> the new schema B. Getting an error just means that the forward
> >>>>>> compatibility feature is broken. But this is not actually the
case
> >> (and
> >>>>>> this behavior is not suitable), because the old reader has a
correct
> >>>> schema
> >>>>>> and this schema has evolved naturally to version B to incorporate
a
> >> new
> >>>>>> Vehicle type. Not knowing this new type must not produce an
error,
> but
> >>>> just
> >>>>>> give the reader a default value, which means: "Either the data
is
> not
> >>>> there
> >>>>>> or you do not know how to handle it".
> >>>>>>
> >>>>>> This is thought while keeping in mind that in an object-oriented
> code
> >>>>>> modeling, a union field is seen as a class member with the higher
> >> level
> >>>>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore,
it
> is
> >>>>>> natural for a modeler / programmer to implement the ability
of not
> >>>> getting
> >>>>>> the awaited types and using some default value of known type.
To
> give
> >> a
> >>>>>> more complete specification, the new mode of compatibility has
to
> >> impose
> >>>>>> one rule: the union default value must not change through versions
> and
> >>>> the
> >>>>>> corresponding type must be placed at the top of the types list.
This
> >> is
> >>>>>> much easier to handle by development streams, because it is
> addressed
> >>>> once
> >>>>>> for all in the very beginning of the schema life-cycle, than
the
> fact
> >> to
> >>>>>> oblige a number of teams, among which some are just not in place
> >>>> anymore,
> >>>>>> to update the whole code just because another dev team has deployed
> a
> >>>> new
> >>>>>> version of the union in the schema.
> >>>>>>
> >>>>>> Now, for being backward compatible, the reader with B must always
be
> >>>> able
> >>>>>> to read data written with schema A. Even if the type included
in the
> >>>> data
> >>>>>> is not known, so it gets the default value and not an error.
> >>>>>>
> >>>>>> I understand that getting an error could make sense when the
> requested
> >>>>>> field is not present. However, this behavior:
> >>>>>>
> >>>>>> - is very restrictive, meaning: this obliges the old reader
to
> update
> >>>>>> its code for integrating the new schema, while he is not managing
to
> >>>> do it
> >>>>>> for many reasons: development stream of next delivery is not
> >>>> finished, or
> >>>>>> not engaged, or not even planned - in the case of old and stable
> code
> >>>>>> - breaks the forward compatibility feature: the older reader
is not
> >>>>>> able to read the new version of the union without getting an
error
> >>>>>> - breaks the backward compatibility feature: the new reader
is not
> >>>>>> able to read an old version containing unknown types of the
union
> >>>> without
> >>>>>> getting an error
> >>>>>>
> >>>>>> By the way, what do you exactly mean by "pushing evolution lower"
> and
> >>>>>> "update the record"? Could you please give me an example of
the
> trick
> >>>> you
> >>>>>> are talking about?
> >>>>>>
> >>>>>> Just to be a bit more precise, we are not targeting to use a
> "trick".
> >>>> This
> >>>>>> life-cycle management should be included in a standard so to
keep
> the
> >>>>>> software development clean, production safe and compliant with
a
> >> complex
> >>>>>> product road-map.
> >>>>>>
> >>>>>> Finally, you seem to be concerned by the "significant change
to the
> >>>>>> current evolution rules". Well, we actually do not change these
> rules,
> >>>> they
> >>>>>> keep just the same. All we are proposing is to introduce a *mode*
> >> where
> >>>>>> the rules of union compatibility change. This mode is materialized
> by
> >> a
> >>>>>> minimum and thin impact of the existing classes without any
change
> in
> >>>> the
> >>>>>> behavior, all the logic of the new compatibility mode is implemented
> >> by
> >>>> new
> >>>>>> classes that must be invoked specifically. But you would better
see
> it
> >>>> in
> >>>>>> the code patch.
> >>>>>>
> >>>>>> Looking forward to reading your feedback and answers.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Yacine.
> >>>>>>
> >>>>>>
> >>>>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <rblue@netflix.com.invalid>:
> >>>>>>
> >>>>>>> Hi Yacine,
> >>>>>>>
> >>>>>>> Thanks for the proposal. It sounds interesting, but I want
to make
> >> sure
> >>>>>>> there's a clear use case for this because it's a significant
change
> >> to
> >>>> the
> >>>>>>> current evolution rules. Right now we guarantee that a reader
will
> >> get
> >>>> an
> >>>>>>> error if the data has an unknown union branch rather than
getting a
> >>>>>>> default
> >>>>>>> value. I think that makes sense: if the reader is requesting
a
> field,
> >>>> it
> >>>>>>> should get the actual datum for it rather than a default
because it
> >>>>>>> doesn't
> >>>>>>> know how to handle it.
> >>>>>>>
> >>>>>>> Could you give us an example use case that requires this
new logic?
> >>>>>>>
> >>>>>>> I just want to make sure we can't solve your problem another
way.
> For
> >>>>>>> example, pushing evolution lower in the schema usually does
the
> >> trick:
> >>>>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1",
> >>>>>>> "RecordV2"], it is usually better to update the record so
that
> older
> >>>>>>> readers can ignore the new fields.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> rb
> >>>>>>>
> >>>>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane
<
> >>>>>>> ybenabderrahmane@octo.com> wrote:
> >>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> In order to provide a solution to the union schema evolution
> >> problem,
> >>>>>>> as it
> >>>>>>>> was earlier notified in the thread "add a type to a
union
> >>>>>>>> <
> >>>>>>>>
> >>>>>>>
> >>>>
> >>
> http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union
> >>>>>>>>> "
> >>>>>>>> of the user mailing list, we decided, for the needs
of the
> reactive
> >>>>>>>> architecture we have implemented for one of our clients,
to
> >> implement
> >>>> an
> >>>>>>>> evolution of the compatibility principle of Avro when
using
> Unions.
> >>>> For
> >>>>>>>> reminder, the asked question was about the way to handle
the case
> >>>> where
> >>>>>>> a
> >>>>>>>> reader, using an old version of a schema that includes
a union,
> >> reads
> >>>>>>> some
> >>>>>>>> data written with a new version of the schema where
a type has
> been
> >>>>>>> added
> >>>>>>>> to the union.
> >>>>>>>>
> >>>>>>>> As answered by Martin Kleppman in that thread, one way
to handle
> >> this
> >>>>>>> kind
> >>>>>>>> of evolution (a new version of the schema adds a new
type type in
> a
> >>>>>>> union)
> >>>>>>>> would be to ensure that all the development streams
have
> integrated
> >>>> the
> >>>>>>> new
> >>>>>>>> schema B before deploying it in the IT schema referential.
> >>>>>>>> However, in big structures involving strongly uncorrelated
teams
> (in
> >>>> the
> >>>>>>>> product life-cycle point of view), this approach appears
to be
> quite
> >>>>>>>> impracticable, causing production stream congestion,
blocking
> >> behavior
> >>>>>>>> between teams, and a bunch of other
> >>>>>>>> unwanted-counter-agile-/-reactive-phenomena...
> >>>>>>>>
> >>>>>>>> Therefore, we had to implement a new *compatibility*
*mode* for
> the
> >>>>>>> unions,
> >>>>>>>> while taking care to comply with the following rules:
> >>>>>>>>
> >>>>>>>> 1. Clear rules of compatibility are stated and integrated
for this
> >>>>>>>> compatibility mode
> >>>>>>>> 2. The standard Avro behavior must be kept intact
> >>>>>>>> 3. All the evolution implementation must be done without
> >> introducing
> >>>>>>> any
> >>>>>>>> regression (all existing tests of the Avro stack must
succeed)
> >>>>>>>> 4. The code impact on Avro stack must be minimized
> >>>>>>>>
> >>>>>>>> Just to give you a very brief overview (as I don't know
if this is
> >>>>>>> actually
> >>>>>>>> the place for a full detailed description), the evolution
> addresses
> >>>> the
> >>>>>>>> typical problem where two development streams use the
same schema
> >> but
> >>>> in
> >>>>>>>> different versions, in the case described shortly as
follows:
> >>>>>>>>
> >>>>>>>> - The first development stream, called "DevA", uses
the version A
> >> of
> >>>>>>> a
> >>>>>>>> schema which integrates a union referencing two types,
say "null"
> >>>> and
> >>>>>>>> "string". The default value is set to null.
> >>>>>>>> - The second development team, called "DevB", uses the
version B,
> >>>>>>> which
> >>>>>>>> is an evolution of the version A, as it adds a reference
to a new
> >>>>>>> type
> >>>>>>>> in
> >>>>>>>> the former union, say "long" (which makes it "null",
string" and
> >>>>>>> "long")
> >>>>>>>> - When the schema B is deployed on the schema referential
(in our
> >>>>>>> case,
> >>>>>>>> the IO Confluent Schema Registry) subsequently to the
version A
> >>>>>>>>    - The stream "DevA" must be able to read with schema
A, even if
> >>>>>>> the
> >>>>>>>>    data has been written using the schema B with the
type "long"
> in
> >>>>>>>> the union.
> >>>>>>>>    In the latter case, the read value is the union default
value
> >>>>>>>>    - The stream "DevB" must be able to read/write with
schema B,
> >>>>>>> even if
> >>>>>>>>    it writes the data using the type "long" in the union
> >>>>>>>>
> >>>>>>>> The evolution that we implemented for this mode includes
some
> rules
> >>>> that
> >>>>>>>> are based on the principles stated in the Avro documentation.
It
> is
> >>>> even
> >>>>>>>> more powerful than showed in the few lines above, as
it enables
> the
> >>>>>>> readers
> >>>>>>>> to get the default value of the union if the schema
used for
> reading
> >>>>>>> does
> >>>>>>>> not contain the type used by the writer in the union.
This
> achieves
> >> a
> >>>>>>> new
> >>>>>>>> mode of forward / backward compatibility. This evolution
is for
> now
> >>>>>>> working
> >>>>>>>> perfectly, and should be on production in the few coming
weeks. We
> >>>> have
> >>>>>>>> also made an evolution of the IO Confluent Schema Registry
stack
> to
> >>>>>>> support
> >>>>>>>> it, again in a transparent manner (we also intend to
contribute to
> >>>> this
> >>>>>>>> stack in a second / parallel step).
> >>>>>>>>
> >>>>>>>> In the objective of contributing to the Avro stack with
this new
> >>>>>>>> compatibility mode for unions, I have some questions
about the
> >>>>>>> procedure:
> >>>>>>>>
> >>>>>>>> 1. How can I achieve the contribution proposal? Should
I directly
> >>>>>>>> provide a patch in JIRA and dive into the details right
there?
> >>>>>>>> 2. The base version of this evolution is 1.7.7, is it
eligible to
> >>>>>>>> contribution evaluation anyway?
> >>>>>>>>
> >>>>>>>> Thanks in advance, looking forward to hearing from you
and giving
> >> you
> >>>>>>> more
> >>>>>>>> details.
> >>>>>>>>
> >>>>>>>> Kind Regards,
> >>>>>>>> --
> >>>>>>>> *Yacine Benabderrahmane*
> >>>>>>>> Architect
> >>>>>>>> *OCTO Technology*
> >>>>>>>> <http://www.octo.com>
> >>>>>>>> -----------------------------------------------
> >>>>>>>> Tel : +33 6 10 88 25 98
> >>>>>>>> 50 avenue des Champs Elysées
> >>>>>>>> 75008 PARIS
> >>>>>>>> www.octo.com
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Ryan Blue
> >>>>>>> Software Engineer
> >>>>>>> Netflix
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Ryan Blue
> >>> Software Engineer
> >>> Netflix
> >>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message