avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryon Day <ryon_...@yahoo.com>
Subject Utf8 to String ClassCastException upon deserialization, setting properties on Union schema
Date Fri, 28 Aug 2015 18:52:44 GMT
Thanks in advance for any help anyone can offer me with this issue. Forgive the scattered presentation,
I am new to Avro and am unsure of how to phrase the exact issue I'm experiencing.
I have been using Java to read Avro messages off of a Kafka topic application in a heterogeneous
environment (the messages are put onto a Kafka queue by a Go client written by a different
team; I have no control over what is put onto the wire or how).
My team controls the schema for the project; we have a .avsc schema file and use the "com.commercehub.gradle.plugin.avro"
Gradle plug-in to generate our Java classes. Currently, I can deserialize messages into our
auto-generated classes effectively. However, the Utf8 data type used in lieu of String is
a pretty constant vexation. We use String quite often, CharSequence rarely and Utf8 never.
Hence, our code is cluttered with constant toString() calls on Utf8 objects which makes navigating
and understanding the source very cumbersome.
The Gradle plugin in question has a "stringType" option which when set to String, does indeed
generate the domain classes with standard Strings instead of the Utf8 objects, but when I
attempt to de-serialize generic messages off the queue, I receive the following Stack trace
(we are using the Avro 1.7.7 Java library:
java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be cast to java.lang.String
at com.employer.avro.schema.DomainObject.put(DomainObject.java:64) at org.apache.avro.generic.GenericData.setField(GenericData.java:573)
at org.apache.avro.generic.GenericData.setField(GenericData.java:590) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)

I am using a similar method to the following to read byte arrays that come off the queue:
byte[] message = ReadTheQueueHere.....SpecificDatumReader<SomeType> = new SpecificDatumReader<>(ourSchema);BinaryDecoder
someTypeDecoder = DecoderFactory.get().binaryDecoder(message, null);SomeType record = specificDatumReader.read(null,
someTypeDecoder); <-- Exception thrown here
Other things I have tried: 
I tried setting the "avro.java.string": "String" property on all String fields within the
Schema. This actually did nothing; the problem still occurs. In fact, the exception above
occurs on a field number that is explicitly marked with "avro.java.string": "String" in the
schema file: 
Java: public void put(int field$, java.lang.Object value$) {...case 3: some_field = (java.lang.String)value$;

Schema file:..{        "name": "some_field",        "type": "string",        "avro.java.string":
In fact, when I toString() the schema, it shows up there too: {"name":"some_field","type":{"type":"string","avro.java.string":"String"},"avro.java.string":"String"}
I tried using GenericData.setStringType( DomainObject.getClassSchema(), StringType.String);
This also did nothing (but see below).I tried setting the String type on the schema as a whole: Schema
schema = new Schema.Parser().parse(this.getClass().getResourceAsStream("/ourSchema.avsc"));
 GenericData.setStringType( schema, StringType.String);This resulted in the following exception: org.apache.avro.AvroRuntimeException:
Can't set properties on a union: [Our Schema as a String]. I suppose this because of some
subtlety in our schema arising from our inexperience with the platform; from what little poking
I've done into this, our entire schema (with several record types within it) is defined as
a union, something I've found very little explanation of what or how it is. 
I worked up a small sample method to reproduce what I think is happening. I believe that
the Go end of things is writing generic records to Kafka (I don't even know if this makes
a difference).
public static SomeType encodeToSpecificRecord(GenericRecord someRecord, Schema writerSchema,
Schema readerSchema) throws Exception {        DatumWriter<GenericRecord> datumWriter
= new GenericDatumWriter<>(writerSchema);
        ByteArrayOutputStream out = new ByteArrayOutputStream();        Encoder encoder
= EncoderFactory.get().binaryEncoder(out, null);        datumWriter.write(someRecord,
encoder);        encoder.flush();        byte[] inBytes = out.toByteArray();   
    out.close();              SpecificDatumReader<SomeType> reader = new SpecificDatumReader<>(readerSchema); 
      Decoder decoder = DecoderFactory.get().binaryDecoder(inBytes, null);        SomeType
result = reader.read(null, decoder);        logger.debug("Read record {}", result.toString());
        return result;    }
The only way that I can get this to work is if I do the following:GenericData.setStringType(SomeType.getClassSchema(),
StringType.String)and call the method with the SomeType.getClassSchema() schema for both reader
and writer schema.
This will not do, because I am pretty sure that this is not how the Go end of things is writing
the data; I am pretty sure they are using the overarching schema as read from the file. If
I mismatch reader and writer schemas, I get the following: 
java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:257)
Can anyone offer any assistance in detangling this problem? Constant conversions of String
<-> Utf8 clutters up the code, reduces efficiency and impacts readability and maintainability. 
Thank you very much!

View raw message