avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Upthegrove <tim.upthegr...@gmail.com>
Subject Re: Differing behavior of Java and Python implementations
Date Tue, 14 Feb 2017 23:20:02 GMT
Hi Brenden,

If your schema uses unions (and if you have nullable fields, then it must),
then the JSON serialized format that Avro JSON serializers create and
deserializers expect (at least in the case of the Java and C++
implementations) explicitly encodes the type of the union in the data.
Because the Python Avro library doesn't have its own JSON encoder and
decoder, and instead you use the Python json module, that union branch is
not expected by the decoder.  This is probably why you are not having any
issues when deserializing into Python, but you are having issues
deserializing with Java.

Thanks,
Tim

On Tue, Feb 14, 2017 at 6:09 PM, Brenden Brown <brenden.brown@placed.com>
wrote:

> We're exploring using replacing json with Avro as our data storage format.
> Our schema is fairly messy, deeply nested, and has several nullable fields.
> I'm writing some code to run a mapreduce step to convert from json to avro
> format.
>
> I managed to get a working prototype on a simple schema, and now I'm
> trying to use the real schema, and running into a case where a Python
> prototype manages to convert a record successfully, while the Java
> prototype throws org.apache.avro.AvroTypeException: Expected start-union.
> Got VALUE_STRING.
>
> Java code:
>
> public void driver() throws Exception {
>     byte[] encoded = Files.readAllBytes(Paths.get("json_file"));
>     String string =  new String(encoded, StandardCharsets.UTF_8);
>     Cluster c = deserializer(string, Cluster.getClassSchema());
> }
>
> public Cluster deserializer(String value, Schema schema) throws
> IOException {
>     InputStream stream = IOUtils.toInputStream(value);
>     SpecificDatumReader<Cluster> reader = new
> SpecificDatumReader<>(schema);
>     JsonDecoder decoder = DecoderFactory.get().jsonDecoder(schema,
> stream);
>     return reader.read(null, decoder);
>
> Python code:
>
> import sys
> import json
> import avro.schema
> from avro.datafile import DataFileReader, DataFileWriter
> from avro.io import DatumReader, DatumWriter
>
> schema = avro.schema.parse(open('Cluster.avsc', "rb").read())
> rec_writer = DatumWriter(schema)
> df_writer  = DataFileWriter(open("users.avro", "wb"), rec_writer, schema)
>
> for line in sys.stdin:
>   cluster_dict = json.loads(data)
>   df_writer.append(cluster_dict)
> df_writer.close()
>
> The input json is untagged. Here's a representative subset:
> {
>   "start_time": 1486000000000,
>   "total_place_count": 0,
>   "appliedDemoMdl": false,
>   "longitude": -99.990911,
>   "significant_place": null,
>   "version": "inferencecore-1.123.0",
>   "end_time": 1486070000000,
>   "latitude": 11.111182,
> }
>
> My main question is why are my two implementations behaving differently?
> Is what I'm trying to do not really possible without writing my own object
> mapper from the json representation?
>
> Brenden
>
>

Mime
View raw message