avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brenden Brown <brenden.br...@placed.com>
Subject Differing behavior of Java and Python implementations
Date Tue, 14 Feb 2017 23:09:17 GMT
We're exploring using replacing json with Avro as our data storage format.
Our schema is fairly messy, deeply nested, and has several nullable fields.
I'm writing some code to run a mapreduce step to convert from json to avro
format.

I managed to get a working prototype on a simple schema, and now I'm trying
to use the real schema, and running into a case where a Python prototype
manages to convert a record successfully, while the Java prototype throws
org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_STRING.

Java code:

public void driver() throws Exception {
    byte[] encoded = Files.readAllBytes(Paths.get("json_file"));
    String string =  new String(encoded, StandardCharsets.UTF_8);
    Cluster c = deserializer(string, Cluster.getClassSchema());
}

public Cluster deserializer(String value, Schema schema) throws IOException
{
    InputStream stream = IOUtils.toInputStream(value);
    SpecificDatumReader<Cluster> reader = new SpecificDatumReader<>(schema);
    JsonDecoder decoder = DecoderFactory.get().jsonDecoder(schema, stream);
    return reader.read(null, decoder);

Python code:

import sys
import json
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open('Cluster.avsc', "rb").read())
rec_writer = DatumWriter(schema)
df_writer  = DataFileWriter(open("users.avro", "wb"), rec_writer, schema)

for line in sys.stdin:
  cluster_dict = json.loads(data)
  df_writer.append(cluster_dict)
df_writer.close()

The input json is untagged. Here's a representative subset:
{
  "start_time": 1486000000000,
  "total_place_count": 0,
  "appliedDemoMdl": false,
  "longitude": -99.990911,
  "significant_place": null,
  "version": "inferencecore-1.123.0",
  "end_time": 1486070000000,
  "latitude": 11.111182,
}

My main question is why are my two implementations behaving differently? Is
what I'm trying to do not really possible without writing my own object
mapper from the json representation?

Brenden

Mime
View raw message