avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Coveney <jcove...@gmail.com>
Subject Re: Issue writing union in avro?
Date Sun, 07 Apr 2013 10:47:15 GMT
Thanks, that is very helpful. It actually makes complete sense (note the
other email where I was wondering exactly  how avro dealt with unions of
similar types), I guess what threw me off is that the python implementation
worked fine.

Thanks again
Jon


2013/4/7 Scott Carey <scottcarey@apache.org>

> It is well documented in the specification:
> http://avro.apache.org/docs/current/spec.html#json_encoding
>
> I know others have overridden this behavior by extending GenericData
> and/or the JsonDecoder/Encoder.  It wouldn't conform to the Avro
> Specification JSON, but you can extend avro do do what you need it to.
>
> The reason for this encoding is to make sure that round-tripping data from
> binary to json and back results in the same data.  Additionally, unions can
> be more complicated and contain multiple records each with different names.
>  Disambiguating the value requires more information since several Avro data
> types map to the same JSON data type.  If the schema is a union of bytes
> and string, is "hello" a string, or byte literal?  If it is a union of a
> map and a record, is {"state":"CA", "city":"Pittsburgh"}  a record with two
> string fields, or a map?   There are other approaches, and for some users
> perfect transmission of types is not critical.  Generally speaking, if you
> want to output Avro data as JSON and consume as JSON, the extra data is not
> helpful.  If you want to read it back in as Avro, you're going to need the
> info to know which branch of the union to take.
>
> On 4/6/13 6:49 PM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:
>
> Err, it's the output format that deserializes the json and then writes it
> in the binary format, not the input format. But either way the general flow
> is the same.
>
> As a general aside, is it the case that the java case is correct in that
> when writing a union it should be {"string": "hello"} or whatnot? Seems
> like we should probably add that to the documentation if it is a
> requirement.
>
>
> 2013/4/7 Jonathan Coveney <jcoveney@gmail.com>
>
>> Scott,
>>
>> Thanks for the input. The use case is that a number of our batch
>> processes are built on python streaming. Currently, the reducer will output
>> a json string as a value, and then the input format will deserialize the
>> json, and then write it in binary format.
>>
>> Given that our use of python streaming isn't going away, any suggestions
>> on how to make this better? Is there a better way to go from json string ->
>> writing binary avro data?
>>
>> Thanks again
>> Jon
>>
>>
>> 2013/4/6 Scott Carey <scottcarey@apache.org>
>>
>>> This is due to using the JSON encoding for avro and not the binary
>>> encoding.  It would appear that the Python version is a little bit lax on
>>> the spec.  Some have built variations of the JSON encoding that do not
>>> label the union, but there are drawbacks to this too, as the type can be
>>> ambiguous in a very large number of cases without a label.
>>>
>>> Why are you using the JSON encoding for Avro?  The primary purpose of
>>> the JSON serialization form as it is now is for transforming the binary to
>>> human readable form.
>>> Instead of building your GenericRecord from a JSON string, try using
>>> GenericRecordBuilder.
>>>
>>> -Scott
>>>
>>> On 4/5/13 4:59 AM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:
>>>
>>> Ok, I figured out the issue:
>>>
>>> If you make string c the following:
>>> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256},
>>> \"favorite_color\": {\"string\": \"blue\"}}";
>>>
>>> Then this works.
>>>
>>> This represents a divergence between the python and the Java
>>> implementation... the above does not work in Python, but it does work in
>>> Java. And of course, vice versa.
>>>
>>> I think I know how to fix this (and can file a bug with my reproduction
>>> and the fix), but I'm not sure which one is the expected case? Which
>>> implementation is wrong?
>>>
>>> Thanks
>>>
>>>
>>> 2013/4/5 Jonathan Coveney <jcoveney@gmail.com>
>>>
>>>> Correction: the issue is when reading the string according to the avro
>>>> schema, not on writing. it fails before I get a chance to write :)
>>>>
>>>>
>>>> 2013/4/5 Jonathan Coveney <jcoveney@gmail.com>
>>>>
>>>>> I implemented essentially the Java avro example but using the
>>>>> GenericDatumWriter and GenericDatumReader and hit an issue.
>>>>>
>>>>> https://gist.github.com/jcoveney/5317904
>>>>>
>>>>> This is the error:
>>>>> Exception in thread "main" java.lang.RuntimeException:
>>>>> org.apache.avro.AvroTypeException: Expected start-union. Got
>>>>> VALUE_NUMBER_INT
>>>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
>>>>> Caused by: org.apache.avro.AvroTypeException: Expected start-union.
>>>>> Got VALUE_NUMBER_INT
>>>>>     at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
>>>>>     at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
>>>>>     at
>>>>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
>>>>>     at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>>>>     at
>>>>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
>>>>>     at
>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
>>>>>     at
>>>>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>>>>>     at
>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>>>>>     at
>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>>>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)
>>>>>
>>>>> Am I doing something wrong? Is this a bug? I'm digging in now but am
>>>>> curious if anyone has seen this before?
>>>>>
>>>>> I get the feeling I am working with Avro in a way that most people do
>>>>> not :)
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message