avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Overby (groverby)" <grove...@cisco.com>
Subject Re: Union resolution in dynamic languages
Date Thu, 05 Jun 2014 13:46:45 GMT
Disallowing multiple named types within a union would break our use cases.

We have a similar problem. With two record types in a union, the Python driver doesn’t choose

We solved this problem by adding a pseudo-reserved key to the dict to indicate which named
type to use. I started the process of open sourcing that patch a few days ago. It’s definitely
a hack, but I’m hoping the community will accept it.

Our patch doesn’t change the time complexity. From a brief glance , choosing within the
union seems to typically be O(n) as the recursion short circuits. For named types, the complexity
could be O(1). Achieving O(1) for non named types seems achievable too. How many projects
are impacted by this ‘wasted’ complexity? Simpler code might be better than faster code.


Grant Overby
Software Engineer
Mobile: 865 724 4910

[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think before you print.

This email may contain confidential and privileged material for the sole use of the intended
recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If
you are not the intended recipient (or authorized to receive for the recipient), please contact
the sender by reply email and delete all copies of this message.

Please click here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html>
for Company Registration Information.

From: Wai Yip Tung <wy@tungwaiyip.info<mailto:wy@tungwaiyip.info>>
Reply-To: "user@avro.apache.org<mailto:user@avro.apache.org>" <user@avro.apache.org<mailto:user@avro.apache.org>>
Date: Wednesday, June 4, 2014 at 9:34 PM
To: "user@avro.apache.org<mailto:user@avro.apache.org>" <user@avro.apache.org<mailto:user@avro.apache.org>>
Subject: Re: Union resolution in dynamic languages

Also I ask about this in the context of building an optimized encoder. For this implementation,
the resolution will be much simpler if we limit union to not support two records, similar
to the spec do not allow two array or two map types. I wonder if this limit breaks any significant
use case.

Wai Yip
Wai Yip Tung<mailto:wy@tungwaiyip.info>
Wednesday, June 04, 2014 4:40 PM
For encoding data of union type, the Avro specification do not say a lot which one of the
type in the union is used. So far I am mostly using union so that I can write null or another
simple type. In these cases, it is fairly obvious for the encoding to distinguish null from
other types.

However a union can also be any named types. So they can be two records. Let say a Manger
record and a NonManager record. I think with strongly typed languages, the suitable type in
the union can be selected by introspection. But for dynamic languages, these might just be
a represented as maps without any notion of type. In some case, we may find that the object
has all the attributes of a NonManager but not the Manager. So we can conclude NonManager
is the proper schema to use. But this can get complicated with nested data structure where
the attribute that can disambiguate thing appear in a deeper level. Or you can think of valid
scenario where inspecting the content of the obj cannot unambiguously resolve the union branch.

I notice that the Python implementation use two pass recursive validation possible for the
reason of for resolving the union choice.

I am wonder if there are much consideration about are potentially complex, indirectly nested
union types that might be difficult to resolve? Thus adding complexity to the implementation
of the encoders? Are there use case in practice that involve complex union decision?

Wai Yip

View raw message