avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Kahn <troc...@trochee.net>
Subject Re: Issue writing union in avro?
Date Tue, 09 Apr 2013 16:16:56 GMT
I will open a JIRA ticket to request a Python StrictJSONEncoder that
produces these type-hints. Probably a StrictJSONDecoder needs to be there
too -- at any rate, the StrictJSONDecoder would be nice so that Python
could consume JSON-encoded output from Java et al.

A StrictJSON{Decoder,Encoder} might provide a (high-IO) workaround to
Jeremy Karn's problem about how to consume avro over a non-seekable
filehandle (e.g., standard in).

As I understand it:

The Python avro library doesn't have a JSON encoder at all: it has a binary
decoder, which deserializes to Python generics. These generics conveniently
serialize to JSON using the json.dumps core library call, but **json.dumps
on a python object is NOT the same as json encoding Avro**.

There are actually two slightly different understandings of "encoded in
JSON" built into the discussion around Python:

  (a) json.dumps(obj) on the Python generic

  (b) "strict" json encoding, which would require knowing when the schema
expects a union and inserting the extra key-name type hint.

(b) is required to preserve type information reliably in JSON, but type
information for union members may *always* be lost in a round trip to
Python generics. if something is encoded as a 'long' when the schema reads
['int', 'long'], the Python code does not guarantee that a avro>python>avro
round trip will be encoded as 'long',

--Jeremy


On Apr 7, 2013 3:47 AM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:

> Thanks, that is very helpful. It actually makes complete sense (note the
> other email where I was wondering exactly  how avro dealt with unions of
> similar types), I guess what threw me off is that the python implementation
> worked fine.
>
> Thanks again
> Jon
>
>
> 2013/4/7 Scott Carey <scottcarey@apache.org>
>
>> It is well documented in the specification:
>> http://avro.apache.org/docs/current/spec.html#json_encoding
>>
>> I know others have overridden this behavior by extending GenericData
>> and/or the JsonDecoder/Encoder.  It wouldn't conform to the Avro
>> Specification JSON, but you can extend avro do do what you need it to.
>>
>> The reason for this encoding is to make sure that round-tripping data
>> from binary to json and back results in the same data.  Additionally,
>> unions can be more complicated and contain multiple records each with
>> different names.  Disambiguating the value requires more information since
>> several Avro data types map to the same JSON data type.  If the schema is a
>> union of bytes and string, is "hello" a string, or byte literal?  If it is
>> a union of a map and a record, is {"state":"CA", "city":"Pittsburgh"}  a
>> record with two string fields, or a map?   There are other approaches, and
>> for some users perfect transmission of types is not critical.  Generally
>> speaking, if you want to output Avro data as JSON and consume as JSON, the
>> extra data is not helpful.  If you want to read it back in as Avro, you're
>> going to need the info to know which branch of the union to take.
>>
>> On 4/6/13 6:49 PM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:
>>
>> Err, it's the output format that deserializes the json and then writes it
>> in the binary format, not the input format. But either way the general flow
>> is the same.
>>
>> As a general aside, is it the case that the java case is correct in that
>> when writing a union it should be {"string": "hello"} or whatnot? Seems
>> like we should probably add that to the documentation if it is a
>> requirement.
>>
>>
>> 2013/4/7 Jonathan Coveney <jcoveney@gmail.com>
>>
>>> Scott,
>>>
>>> Thanks for the input. The use case is that a number of our batch
>>> processes are built on python streaming. Currently, the reducer will output
>>> a json string as a value, and then the input format will deserialize the
>>> json, and then write it in binary format.
>>>
>>> Given that our use of python streaming isn't going away, any suggestions
>>> on how to make this better? Is there a better way to go from json string ->
>>> writing binary avro data?
>>>
>>> Thanks again
>>> Jon
>>>
>>>
>>> 2013/4/6 Scott Carey <scottcarey@apache.org>
>>>
>>>> This is due to using the JSON encoding for avro and not the binary
>>>> encoding.  It would appear that the Python version is a little bit lax on
>>>> the spec.  Some have built variations of the JSON encoding that do not
>>>> label the union, but there are drawbacks to this too, as the type can be
>>>> ambiguous in a very large number of cases without a label.
>>>>
>>>> Why are you using the JSON encoding for Avro?  The primary purpose of
>>>> the JSON serialization form as it is now is for transforming the binary to
>>>> human readable form.
>>>> Instead of building your GenericRecord from a JSON string, try using
>>>> GenericRecordBuilder.
>>>>
>>>> -Scott
>>>>
>>>> On 4/5/13 4:59 AM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:
>>>>
>>>> Ok, I figured out the issue:
>>>>
>>>> If you make string c the following:
>>>> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256},
>>>> \"favorite_color\": {\"string\": \"blue\"}}";
>>>>
>>>> Then this works.
>>>>
>>>> This represents a divergence between the python and the Java
>>>> implementation... the above does not work in Python, but it does work in
>>>> Java. And of course, vice versa.
>>>>
>>>> I think I know how to fix this (and can file a bug with my reproduction
>>>> and the fix), but I'm not sure which one is the expected case? Which
>>>> implementation is wrong?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> 2013/4/5 Jonathan Coveney <jcoveney@gmail.com>
>>>>
>>>>> Correction: the issue is when reading the string according to the avro
>>>>> schema, not on writing. it fails before I get a chance to write :)
>>>>>
>>>>>
>>>>> 2013/4/5 Jonathan Coveney <jcoveney@gmail.com>
>>>>>
>>>>>> I implemented essentially the Java avro example but using the
>>>>>> GenericDatumWriter and GenericDatumReader and hit an issue.
>>>>>>
>>>>>> https://gist.github.com/jcoveney/5317904
>>>>>>
>>>>>> This is the error:
>>>>>> Exception in thread "main" java.lang.RuntimeException:
>>>>>> org.apache.avro.AvroTypeException: Expected start-union. Got
>>>>>> VALUE_NUMBER_INT
>>>>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
>>>>>> Caused by: org.apache.avro.AvroTypeException: Expected start-union.
>>>>>> Got VALUE_NUMBER_INT
>>>>>>     at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
>>>>>>     at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
>>>>>>     at
>>>>>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
>>>>>>     at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>>>>>     at
>>>>>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
>>>>>>     at
>>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
>>>>>>     at
>>>>>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>>>>>>     at
>>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>>>>>>     at
>>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>>>>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)
>>>>>>
>>>>>> Am I doing something wrong? Is this a bug? I'm digging in now but
am
>>>>>> curious if anyone has seen this before?
>>>>>>
>>>>>> I get the feeling I am working with Avro in a way that most people
do
>>>>>> not :)
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message