avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthieu Monsch <mon...@alum.mit.edu>
Subject Re: Avro union compatibility mode enhancement proposal
Date Sun, 17 Apr 2016 00:54:56 GMT
> I think it may make sense for
> enums to add a special value for it.

That would work great. To give slightly more flexibility to users, we could even allow the
reader’s schema to explicitly specify which symbol to use when it reads an unknown symbol.
If not specified, resolution would fail (consistent with the current behavior).

For example (assuming the corresponding key is called “onUnknown”):

> {
>   “type”: “enum”,
>   “name”: “Platform”,
>   “symbols”: [“UNSUPPORTED”, “ANDROID”, “IOS”],
>   “onUnknown”: “UNSUPPORTED"
> }

This `enum` would then be able to resolve schemas with extra symbols (read as `UNSUPPORTED`).

> On Apr 16, 2016, at 1:36 PM, Ryan Blue <rblue@netflix.com.INVALID> wrote:
> Matthieu, how would that work with enums? I think it may make sense for
> enums to add a special value for it.
> rb
> On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <monsch@alum.mit.edu>
> wrote:
>>> I think that substituting different data for
>>> unrecognized branches in a union isn't the way to fix the problem, but I
>>> have a proposal for a way to fix it that looks more like you'd expect in
>>> the OO example above: by adding the Vehicle class to your read schema.
>>> Right now, unions are first resolved by full class name as required by
>> the
>>> spec. But after that, we have some additional rules to match schemas.
>> These
>>> rules are how we reconcile name differences from situations like writing
>>> with a generic class and reading with a specific class. I'm proposing you
>>> use a catch-all class (the superclass) with fields that are in all of the
>>> union's branches, and we update schema resolution to allow it.
>> That sounds good. The only thing I’d add is to make this catch-all
>> behavior explicit in the schema (similar to how field defaults must be
>> explicitly added).
>> To help fix another common writer evolution issue, we could also add a
>> similar catch-all for `enum`s (optional, to be explicitly specified in the
>> schema).
>> -Matthieu
>>> On Apr 12, 2016, at 2:41 PM, Ryan Blue <rblue@netflix.com.INVALID>
>> wrote:
>>> Yacine,
>>> Thanks for the extra information. Sorry for my delay in replying, I
>> wanted
>>> to time think about your suggestion.
>>> I see what you mean that you can think of a union as the superclass of
>> its
>>> options. The reflect object model has test code that does just that,
>> where
>>> classes B and C inherit from A and the schema for A is created as a union
>>> of B and C. But, I don't think that your suggestion aligns with the
>>> expectations of object oriented design. Maybe that's an easier way to
>>> present my concern:
>>> Say I have a class, Vehicle, with subclasses Car and Truck. I have
>>> applications that work with my dataset, the vehicles that my company
>> owns,
>>> and we buy a bus. If I add a Bus class, what normally happens is that an
>>> older application can work with it. A maintenance tracker would call
>>> getVehicles and can still get the lastMaintenanceDate for my Bus, even
>>> though it doesn't know about busses. But what you suggest is that it is
>>> replaced with a default, say null, in cases like this.
>>> I think that the problem is that Avro no equivalent concept of
>> inheritance.
>>> There's only one way to represent it for what you need right now, like
>>> Matthieu suggested. I think that substituting different data for
>>> unrecognized branches in a union isn't the way to fix the problem, but I
>>> have a proposal for a way to fix it that looks more like you'd expect in
>>> the OO example above: by adding the Vehicle class to your read schema.
>>> Right now, unions are first resolved by full class name as required by
>> the
>>> spec. But after that, we have some additional rules to match schemas.
>> These
>>> rules are how we reconcile name differences from situations like writing
>>> with a generic class and reading with a specific class. I'm proposing you
>>> use a catch-all class (the superclass) with fields that are in all of the
>>> union's branches, and we update schema resolution to allow it. So you'd
>>> have...
>>> record Vehicle {
>>> long lastMaintenanceDate;
>>> }
>>> record Car {
>>> long lastMaintenanceDate;
>>> int maxChildSeats;
>>> }
>>> record Truck {
>>> long lastMainetanceDate;
>>> long lastWashedDate;
>>> }
>>> Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The
>>> catch-all class could provide the behavior you want, without silently
>>> giving you incorrect data.
>>> Does that sound like a reasonable solution for your use case?
>>> rb
>>> On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <monsch@alum.mit.edu>
>>> wrote:
>>>> Hi Yacine,
>>>> I believe Ryan was mentioning that if you start from a schema `[“null”,
>>>> “Car”]` then rather than add a new bus branch to the union, you could
>>>> update the car’s schema to include new bus fields. For example (using
>> IDL
>>>> notation):
>>>>> // Initial approach
>>>>> // Evolved to union { null, Car, Bus }
>>>>> record Car {
>>>>> string vin;
>>>>> }
>>>>> record Bus {
>>>>> string vin;
>>>>> int capacity;
>>>>> }
>>>>> // Alternative approach
>>>>> // Here evolution wouldn't change the union { null, Vehicle }
>>>>> // Note also that this is actually a compatible evolution from the
>> first
>>>> approach, even under current rules (using aliases for name changes).
>>>>> record Vehicle {
>>>>> string vin; // (Would be optional if not all branches had it.)
>>>>> union { null, int } capacity; // The new field is added here.
>>>>> }
>>>> Pushing this further, you could also directly start upfront with just a
>>>> “Vehicle” record schema with an optional `car` field and add new
>> optional
>>>> fields as necessary (for example a `bus` field). This gives you much
>>>> greater flexibility for evolving readers and writers separately (but you
>>>> lose the ability to enforce that at most one field is ever present in
>> the
>>>> record).
>>>> I agree that Avro’s current evolution rules are more geared towards
>> reader
>>>> evolution: evolving writers while keeping readers constant is difficult
>>>> (namely because they don't support adding branches to unions or symbols
>> to
>>>> enums). However, adding a completely new and separate "compatibility
>> mode”
>>>> has a high complexity cost; both for implementors (the separate classes
>> you
>>>> mention) and users (who must invoke them specifically). It would be
>> best to
>>>> keep evolution rules universal.
>>>> Maybe we could extend the logic behind field defaults? Enum schemas
>> could
>>>> simply contain a default symbol attribute. For unions, it’s a bit
>> trickier,
>>>> but we should be able to get around with a default branch attribute
>> which
>>>> would allow evolution if and only if the newly added branches can be
>> read
>>>> as the default branch. Assuming the attribute names don’t already exist,
>>>> this would be compatible with the current behavior.
>>>> I think this use-case is common enough that it’d be worth exploring (for
>>>> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address this
>>>> issue; though not completely unfortunately [2]).
>>>> -Matthieu
>>>> [1] https://github.com/linkedin/rest.li
>>>> [2]
>> https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible
>>>>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane <
>>>> ybenabderrahmane@octo.com> wrote:
>>>>> Hi Ryan,
>>>>> Just a little up^^. Could you please (or anyone else) give me a little
>>>>> feedback ?
>>>>> Thanks in advance.
>>>>> Regards,
>>>>> Yacine.
>>>>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane <
>>>> ybenabderrahmane@octo.com
>>>>>> :
>>>>>> Hi Ryan,
>>>>>> Thank you for giving feedback. I will try in the following to provide
>>>> you
>>>>>> some more details about the addressed problem.
>>>>>> But before that, just a brief reminder of the context. Avro has been
>>>>>> chosen in this project (and by many other ones for sure) especially
>> for
>>>> a
>>>>>> very important feature: enabling forward and backward compatibility
>>>>>> management through schema life-cycle. Our development model involves
>>>>>> intensive usage of this feature, and many heavy developments are
>> in
>>>>>> parallel streams inside feature teams that share the same schema,
>>>> provided
>>>>>> the evolution of the latter complies with the stated compatibility
>>>> rules.
>>>>>> This implies that all the entities supported by the Avro schema must
>>>>>> support the two-way compatibility, including unions. However, in
>>>>>> special case of the union, this two-way compatibility is not well
>>>> supported
>>>>>> by the current rules. Let me explain you the basement of our point
>>>> view,
>>>>>> it remains quite simple.
>>>>>> The use case is to have, for example,
>>>>>> - a first union version A:
>>>>>>    { "name": "Vehicle",
>>>>>>      "type": ["null", "Car"] }
>>>>>> - a new version of it B:
>>>>>>    { "name": "Vehicle",
>>>>>>      "type": ["null", "Car", "Bus"] }
>>>>>> For being forward compatible, an evolution of the union schema must
>>>>>> guarantee that an old reader reading with A can read the data written
>>>> with
>>>>>> the new schema B. Getting an error just means that the forward
>>>>>> compatibility feature is broken. But this is not actually the case
>> (and
>>>>>> this behavior is not suitable), because the old reader has a correct
>>>> schema
>>>>>> and this schema has evolved naturally to version B to incorporate
>> new
>>>>>> Vehicle type. Not knowing this new type must not produce an error,
>>>> just
>>>>>> give the reader a default value, which means: "Either the data is
>>>> there
>>>>>> or you do not know how to handle it".
>>>>>> This is thought while keeping in mind that in an object-oriented
>>>>>> modeling, a union field is seen as a class member with the higher
>> level
>>>>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore,
it is
>>>>>> natural for a modeler / programmer to implement the ability of not
>>>> getting
>>>>>> the awaited types and using some default value of known type. To
>> a
>>>>>> more complete specification, the new mode of compatibility has to
>> impose
>>>>>> one rule: the union default value must not change through versions
>>>> the
>>>>>> corresponding type must be placed at the top of the types list. This
>> is
>>>>>> much easier to handle by development streams, because it is addressed
>>>> once
>>>>>> for all in the very beginning of the schema life-cycle, than the
>> to
>>>>>> oblige a number of teams, among which some are just not in place
>>>> anymore,
>>>>>> to update the whole code just because another dev team has deployed
>>>> new
>>>>>> version of the union in the schema.
>>>>>> Now, for being backward compatible, the reader with B must always
>>>> able
>>>>>> to read data written with schema A. Even if the type included in
>>>> data
>>>>>> is not known, so it gets the default value and not an error.
>>>>>> I understand that getting an error could make sense when the requested
>>>>>> field is not present. However, this behavior:
>>>>>> - is very restrictive, meaning: this obliges the old reader to update
>>>>>> its code for integrating the new schema, while he is not managing
>>>> do it
>>>>>> for many reasons: development stream of next delivery is not
>>>> finished, or
>>>>>> not engaged, or not even planned - in the case of old and stable
>>>>>> - breaks the forward compatibility feature: the older reader is not
>>>>>> able to read the new version of the union without getting an error
>>>>>> - breaks the backward compatibility feature: the new reader is not
>>>>>> able to read an old version containing unknown types of the union
>>>> without
>>>>>> getting an error
>>>>>> By the way, what do you exactly mean by "pushing evolution lower"
>>>>>> "update the record"? Could you please give me an example of the trick
>>>> you
>>>>>> are talking about?
>>>>>> Just to be a bit more precise, we are not targeting to use a "trick".
>>>> This
>>>>>> life-cycle management should be included in a standard so to keep
>>>>>> software development clean, production safe and compliant with a
>> complex
>>>>>> product road-map.
>>>>>> Finally, you seem to be concerned by the "significant change to the
>>>>>> current evolution rules". Well, we actually do not change these rules,
>>>> they
>>>>>> keep just the same. All we are proposing is to introduce a *mode*
>> where
>>>>>> the rules of union compatibility change. This mode is materialized
>> a
>>>>>> minimum and thin impact of the existing classes without any change
>>>> the
>>>>>> behavior, all the logic of the new compatibility mode is implemented
>> by
>>>> new
>>>>>> classes that must be invoked specifically. But you would better see
>>>> in
>>>>>> the code patch.
>>>>>> Looking forward to reading your feedback and answers.
>>>>>> Regards,
>>>>>> Yacine.
>>>>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <rblue@netflix.com.invalid>:
>>>>>>> Hi Yacine,
>>>>>>> Thanks for the proposal. It sounds interesting, but I want to
>> sure
>>>>>>> there's a clear use case for this because it's a significant
>> to
>>>> the
>>>>>>> current evolution rules. Right now we guarantee that a reader
>> get
>>>> an
>>>>>>> error if the data has an unknown union branch rather than getting
>>>>>>> default
>>>>>>> value. I think that makes sense: if the reader is requesting
a field,
>>>> it
>>>>>>> should get the actual datum for it rather than a default because
>>>>>>> doesn't
>>>>>>> know how to handle it.
>>>>>>> Could you give us an example use case that requires this new
>>>>>>> I just want to make sure we can't solve your problem another
way. For
>>>>>>> example, pushing evolution lower in the schema usually does the
>> trick:
>>>>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1",
>>>>>>> "RecordV2"], it is usually better to update the record so that
>>>>>>> readers can ignore the new fields.
>>>>>>> Thanks,
>>>>>>> rb
>>>>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane <
>>>>>>> ybenabderrahmane@octo.com> wrote:
>>>>>>>> Hi all,
>>>>>>>> In order to provide a solution to the union schema evolution
>> problem,
>>>>>>> as it
>>>>>>>> was earlier notified in the thread "add a type to a union
>>>>>>>> <
>> http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union
>>>>>>>>> "
>>>>>>>> of the user mailing list, we decided, for the needs of the
>>>>>>>> architecture we have implemented for one of our clients,
>> implement
>>>> an
>>>>>>>> evolution of the compatibility principle of Avro when using
>>>> For
>>>>>>>> reminder, the asked question was about the way to handle
the case
>>>> where
>>>>>>> a
>>>>>>>> reader, using an old version of a schema that includes a
>> reads
>>>>>>> some
>>>>>>>> data written with a new version of the schema where a type
has been
>>>>>>> added
>>>>>>>> to the union.
>>>>>>>> As answered by Martin Kleppman in that thread, one way to
>> this
>>>>>>> kind
>>>>>>>> of evolution (a new version of the schema adds a new type
type in a
>>>>>>> union)
>>>>>>>> would be to ensure that all the development streams have
>>>> the
>>>>>>> new
>>>>>>>> schema B before deploying it in the IT schema referential.
>>>>>>>> However, in big structures involving strongly uncorrelated
teams (in
>>>> the
>>>>>>>> product life-cycle point of view), this approach appears
to be quite
>>>>>>>> impracticable, causing production stream congestion, blocking
>> behavior
>>>>>>>> between teams, and a bunch of other
>>>>>>>> unwanted-counter-agile-/-reactive-phenomena...
>>>>>>>> Therefore, we had to implement a new *compatibility* *mode*
for the
>>>>>>> unions,
>>>>>>>> while taking care to comply with the following rules:
>>>>>>>> 1. Clear rules of compatibility are stated and integrated
for this
>>>>>>>> compatibility mode
>>>>>>>> 2. The standard Avro behavior must be kept intact
>>>>>>>> 3. All the evolution implementation must be done without
>> introducing
>>>>>>> any
>>>>>>>> regression (all existing tests of the Avro stack must succeed)
>>>>>>>> 4. The code impact on Avro stack must be minimized
>>>>>>>> Just to give you a very brief overview (as I don't know if
this is
>>>>>>> actually
>>>>>>>> the place for a full detailed description), the evolution
>>>> the
>>>>>>>> typical problem where two development streams use the same
>> but
>>>> in
>>>>>>>> different versions, in the case described shortly as follows:
>>>>>>>> - The first development stream, called "DevA", uses the version
>> of
>>>>>>> a
>>>>>>>> schema which integrates a union referencing two types, say
>>>> and
>>>>>>>> "string". The default value is set to null.
>>>>>>>> - The second development team, called "DevB", uses the version
>>>>>>> which
>>>>>>>> is an evolution of the version A, as it adds a reference
to a new
>>>>>>> type
>>>>>>>> in
>>>>>>>> the former union, say "long" (which makes it "null", string"
>>>>>>> "long")
>>>>>>>> - When the schema B is deployed on the schema referential
(in our
>>>>>>> case,
>>>>>>>> the IO Confluent Schema Registry) subsequently to the version
>>>>>>>>    - The stream "DevA" must be able to read with schema A,
even if
>>>>>>> the
>>>>>>>>    data has been written using the schema B with the type
"long" in
>>>>>>>> the union.
>>>>>>>>    In the latter case, the read value is the union default
>>>>>>>>    - The stream "DevB" must be able to read/write with schema
>>>>>>> even if
>>>>>>>>    it writes the data using the type "long" in the union
>>>>>>>> The evolution that we implemented for this mode includes
some rules
>>>> that
>>>>>>>> are based on the principles stated in the Avro documentation.
It is
>>>> even
>>>>>>>> more powerful than showed in the few lines above, as it enables
>>>>>>> readers
>>>>>>>> to get the default value of the union if the schema used
for reading
>>>>>>> does
>>>>>>>> not contain the type used by the writer in the union. This
>> a
>>>>>>> new
>>>>>>>> mode of forward / backward compatibility. This evolution
is for now
>>>>>>> working
>>>>>>>> perfectly, and should be on production in the few coming
weeks. We
>>>> have
>>>>>>>> also made an evolution of the IO Confluent Schema Registry
stack to
>>>>>>> support
>>>>>>>> it, again in a transparent manner (we also intend to contribute
>>>> this
>>>>>>>> stack in a second / parallel step).
>>>>>>>> In the objective of contributing to the Avro stack with this
>>>>>>>> compatibility mode for unions, I have some questions about
>>>>>>> procedure:
>>>>>>>> 1. How can I achieve the contribution proposal? Should I
>>>>>>>> provide a patch in JIRA and dive into the details right there?
>>>>>>>> 2. The base version of this evolution is 1.7.7, is it eligible
>>>>>>>> contribution evaluation anyway?
>>>>>>>> Thanks in advance, looking forward to hearing from you and
>> you
>>>>>>> more
>>>>>>>> details.
>>>>>>>> Kind Regards,
>>>>>>>> --
>>>>>>>> *Yacine Benabderrahmane*
>>>>>>>> Architect
>>>>>>>> *OCTO Technology*
>>>>>>>> <http://www.octo.com>
>>>>>>>> -----------------------------------------------
>>>>>>>> Tel : +33 6 10 88 25 98
>>>>>>>> 50 avenue des Champs Elysées
>>>>>>>> 75008 PARIS
>>>>>>>> www.octo.com
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
> -- 
> Ryan Blue
> Software Engineer
> Netflix

View raw message