avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Spencer Nelson...@spencerwnelson.com>
Subject Re: What are the rules when encoding a union of records and/or maps?
Date Thu, 11 Mar 2021 20:22:48 GMT
It turns out that the Python implementation takes the last path that
matches.

Agreed that it's deterministic within a language, but it might round-trip
inconsistently.

For example, suppose Java takes the first path that matches, and Python
takes the last path that matches. Then, if I 1. serialize with Java, 2.
deserialize with Python, and 3. reserialize with Python, then the encoded
bytes will be different after 1 vs 3. Certainly, each will be able to read
the encoded data, but its binary representation has changed.

Maybe that's okay. It's a little unfortunate for writing tests, and
violates some expectations - you'd think that decoding and re-encoding
data, without changing anything in it, would not change its bytes on disk.

On Fri, Mar 5, 2021 at 8:53 AM Ryan Blue <rblue@netflix.com.invalid> wrote:

> I think the behavior when encoding that would be to produce the map. I
> would expect that because I'm assuming Python uses the first path that
> appears to match. When it's ambiguous which way an in-memory representation
> maps to a schema, it's up to the implementation to choose.
>
> Whatever python chooses, the actual encoding is deterministic. Either the
> map or the record will be chosen and the bytes produced will always
> deserialize to that representation if you read it in another language
> implementation.
>
> On Thu, Mar 4, 2021 at 5:30 PM Spencer Nelson <s@spencerwnelson.com>
> wrote:
>
> > Suppose a schema like this - a union of a map and a record:
> >
> > [
> >     {"type": "map", "values": "int"},
> >     {"type": "record", "name": "Record", fields: [{"name": "field",
> > "type": "int"}]}
> > ]
> >
> > In Python, unserialized maps and records are both represented as
> > dictionaries. So, if an Avro Python library were asked to encode this
> > message:
> >
> >     {"field": 1}
> >
> > What should it do? Should it describe the value as the map type, or
> > the record type, when encoding the union?
> >
> > Similarly, I wonder about cases where multiple records are in a union.
> > I think it's easy to imagine the ambiguous cases without spelling it
> > all out.
> >
> > Maybe this ambiguity is specific to the Python implementation, I'm not
> > sure.
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message