Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B2C7F1007A for ; Sun, 7 Apr 2013 07:16:59 +0000 (UTC) Received: (qmail 37389 invoked by uid 500); 7 Apr 2013 07:16:59 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 36935 invoked by uid 500); 7 Apr 2013 07:16:58 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 36916 invoked by uid 99); 7 Apr 2013 07:16:58 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Apr 2013 07:16:58 +0000 Received: from localhost (HELO [192.168.2.181]) (127.0.0.1) (smtp-auth username scottcarey, mechanism login) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Apr 2013 07:16:58 +0000 User-Agent: Microsoft-MacOutlook/14.3.2.130206 Date: Sun, 07 Apr 2013 00:16:52 -0700 Subject: Re: Issue writing union in avro? From: Scott Carey Sender: Scott Carey To: "user@avro.apache.org" Message-ID: Thread-Topic: Issue writing union in avro? In-Reply-To: Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3448138617_11429555" > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3448138617_11429555 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit It is well documented in the specification: http://avro.apache.org/docs/current/spec.html#json_encoding I know others have overridden this behavior by extending GenericData and/or the JsonDecoder/Encoder. It wouldn't conform to the Avro Specification JSON, but you can extend avro do do what you need it to. The reason for this encoding is to make sure that round-tripping data from binary to json and back results in the same data. Additionally, unions can be more complicated and contain multiple records each with different names. Disambiguating the value requires more information since several Avro data types map to the same JSON data type. If the schema is a union of bytes and string, is "hello" a string, or byte literal? If it is a union of a map and a record, is {"state":"CA", "city":"Pittsburgh"} a record with two string fields, or a map? There are other approaches, and for some users perfect transmission of types is not critical. Generally speaking, if you want to output Avro data as JSON and consume as JSON, the extra data is not helpful. If you want to read it back in as Avro, you're going to need the info to know which branch of the union to take. On 4/6/13 6:49 PM, "Jonathan Coveney" wrote: > Err, it's the output format that deserializes the json and then writes it in > the binary format, not the input format. But either way the general flow is > the same. > > As a general aside, is it the case that the java case is correct in that when > writing a union it should be {"string": "hello"} or whatnot? Seems like we > should probably add that to the documentation if it is a requirement. > > > 2013/4/7 Jonathan Coveney >> Scott, >> >> Thanks for the input. The use case is that a number of our batch processes >> are built on python streaming. Currently, the reducer will output a json >> string as a value, and then the input format will deserialize the json, and >> then write it in binary format. >> >> Given that our use of python streaming isn't going away, any suggestions on >> how to make this better? Is there a better way to go from json string -> >> writing binary avro data? >> >> Thanks again >> Jon >> >> >> 2013/4/6 Scott Carey >>> This is due to using the JSON encoding for avro and not the binary encoding. >>> It would appear that the Python version is a little bit lax on the spec. >>> Some have built variations of the JSON encoding that do not label the union, >>> but there are drawbacks to this too, as the type can be ambiguous in a very >>> large number of cases without a label. >>> >>> Why are you using the JSON encoding for Avro? The primary purpose of the >>> JSON serialization form as it is now is for transforming the binary to human >>> readable form. >>> Instead of building your GenericRecord from a JSON string, try using >>> GenericRecordBuilder. >>> >>> -Scott >>> >>> On 4/5/13 4:59 AM, "Jonathan Coveney" wrote: >>> >>>> Ok, I figured out the issue: >>>> >>>> If you make string c the following: >>>> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256}, >>>> \"favorite_color\": {\"string\": \"blue\"}}"; >>>> >>>> Then this works. >>>> >>>> This represents a divergence between the python and the Java >>>> implementation... the above does not work in Python, but it does work in >>>> Java. And of course, vice versa. >>>> >>>> I think I know how to fix this (and can file a bug with my reproduction and >>>> the fix), but I'm not sure which one is the expected case? Which >>>> implementation is wrong? >>>> >>>> Thanks >>>> >>>> >>>> 2013/4/5 Jonathan Coveney >>>>> Correction: the issue is when reading the string according to the avro >>>>> schema, not on writing. it fails before I get a chance to write :) >>>>> >>>>> >>>>> 2013/4/5 Jonathan Coveney >>>>>> I implemented essentially the Java avro example but using the >>>>>> GenericDatumWriter and GenericDatumReader and hit an issue. >>>>>> >>>>>> https://gist.github.com/jcoveney/5317904 >>>>>> >>>>>> This is the error: >>>>>> Exception in thread "main" java.lang.RuntimeException: >>>>>> org.apache.avro.AvroTypeException: Expected start-union. Got >>>>>> VALUE_NUMBER_INT >>>>>> at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45) >>>>>> Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got >>>>>> VALUE_NUMBER_INT >>>>>> at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697) >>>>>> at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441) >>>>>> at >>>>>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) >>>>>> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) >>>>>> at >>>>>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1 >>>>>> 52) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader. >>>>>> java:177) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1 >>>>>> 48) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1 >>>>>> 39) >>>>>> at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38) >>>>>> >>>>>> Am I doing something wrong? Is this a bug? I'm digging in now but am >>>>>> curious if anyone has seen this before? >>>>>> >>>>>> I get the feeling I am working with Avro in a way that most people do not >>>>>> :) >>>>>> >>>>> >>>> >> > --B_3448138617_11429555 Content-type: text/html; charset="US-ASCII" Content-transfer-encoding: quoted-printable
It is well documented in the= specification:

I know others have overridden this behavior= by extending GenericData and/or the JsonDecoder/Encoder.  It wouldn't = conform to the Avro Specification JSON, but you can extend avro do do what y= ou need it to.

The reason for this encoding is to m= ake sure that round-tripping data from binary to json and back results in th= e same data.  Additionally, unions can be more complicated and contain = multiple records each with different names.  Disambiguating the value r= equires more information since several Avro data types map to the same JSON = data type.  If the schema is a union of bytes and string, is "hello" a = string, or byte literal?  If it is a union of a map and a record, is {"= state":"CA", "city":"Pittsburgh"}  a record with two string fields, or = a map?   There are other approaches, and for some users perfect transmi= ssion of types is not critical.  Generally speaking, if you want to out= put Avro data as JSON and consume as JSON, the extra data is not helpful. &n= bsp;If you want to read it back in as Avro, you're going to need the info to= know which branch of the union to take.

On 4/6/13 6:49 PM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:

= Err, it's the output format that deserializes the json and then writes it in= the binary format, not the input format. But either way the general flow is= the same.

As a general aside, is it the case that the ja= va case is correct in that when writing a union it should be {"string": "hel= lo"} or whatnot? Seems like we should probably add that to the documentation= if it is a requirement.


2013/4/7 Jonathan Coveney <jcoveney@gmail.com>
Scott,

Thanks for the input. The use case is that a number of = our batch processes are built on python streaming. Currently, the reducer wi= ll output a json string as a value, and then the input format will deseriali= ze the json, and then write it in binary format.

Gi= ven that our use of python streaming isn't going away, any suggestions on ho= w to make this better? Is there a better way to go from json string -> wr= iting binary avro data?

Thanks again
Jon<= /div>

=
2013/4/6 Scott Carey <scottcarey@apache.org&g= t;
This is due to using the JSO= N encoding for avro and not the binary encoding.  It would appear that = the Python version is a little bit lax on the spec.  Some have built va= riations of the JSON encoding that do not label the union, but there are dra= wbacks to this too, as the type can be ambiguous in a very large number of cases = without a label.

Why are you using the JSON encodin= g for Avro?  The primary purpose of the JSON serialization form as it i= s now is for transforming the binary to human readable form. 
Instead of building your GenericRecord from a JSON string, try using Generi= cRecordBuilder. 

-= Scott

On 4/5/13 = 4:59 AM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:

=
Ok, I figured out the issue:

If you make string c the following:
String c =3D "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256}, \"f= avorite_color\": {\"string\": \"blue\"}}";

Then this works.

This represents a divergence between the python and the Java implementation= ... the above does not work in Python, but it does work in Java. And of cour= se, vice versa.

I think I know how to fix this (and can file a bug with my reproduction and= the fix), but I'm not sure which one is the expected case? Which implementa= tion is wrong?

Thanks


2= 013/4/5 Jonathan Coveney <jcoveney@gmail.com>
Correction: the issue is when reading the string acco= rding to the avro schema, not on writing. it fails before I get a chance to = write :)


2013/4/5 Jonathan Coveney <jcoveney@gmail.com>
I implemented essentially the Java avro ex= ample but using the GenericDatumWriter and GenericDatumReader and hit an iss= ue.
This i= s the error:
Exception in thread "main" java.lang.RuntimeException: org.apache.avro.Avro= TypeException: Expected start-union. Got VALUE_NUMBER_INT
    at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got VAL= UE_NUMBER_INT
    at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java= :697)
    at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.= java:441)
    at org.apache.avro.io.ResolvingDecoder.doAction(Resolvin= gDecoder.java:229)
    at org.apache.avro.io.parsing.Parser.advance(Parser.java= :88)
    at org.apache.avro.io.ResolvingDecoder.readIndex(Resolvi= ngDecoder.java:206)
    at org.apache.avro.generic.GenericDatumReader.read(Gener= icDatumReader.java:152)
    at org.apache.avro.generic.GenericDatumReader.readRecord= (GenericDatumReader.java:177)
    at org.apache.avro.generic.GenericDatumReader.read(Gener= icDatumReader.java:148)
    at org.apache.avro.generic.GenericDatumReader.read(Gener= icDatumReader.java:139)
    at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)
Am I doing something wrong? Is this a bug? I'm digging in now b= ut am curious if anyone has seen this before?

I get the feeling I am working with Avro in a way that most people do not := )



=


=
--B_3448138617_11429555--