avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Smith <bnsm...@gmail.com>
Subject Best practices for reading union types (and an offer to write a patch)
Date Mon, 06 Oct 2014 17:45:16 GMT
I'm working on a project that involves reading in records that may be one
of a large number of different types. Here's a sample schema to give you
the basic idea:

protocol AvroTestProtocol {
    @namespace("com.company.project")
    record Type1Data {
        long pressure;
        double temperature;
    }

    @namespace("com.company.project")
    record Type2Data {
        long pressure;
        long sensor_type;
    }

    @namespace("com.company.project")
    record Type3Data {
        double speed;
    }

    @namespace("com.company.project")
    record AvroTestRecord {
        long general_info;
        union {Type1Data, Type2Data, Type3Data} specific_info;
    }
}

The example only has three different types, but in the real world there are
many more. Some of the different types share some data items (the
'pressure' element appears in both 'Type1Data' and 'Type2Data'). Using an
Avro 'union' seems like a natural way of dealing with this kind of
situation. Anyway, I've used 'idl2schemata' and 'compile schema' to create
some Java code, and I'm using that to read in data with this format. I have
some working Scala code to read in this data, but I'm not sure if my method
is the *right* way to do this kind of thing:

object Main extends App {
    val file = new File("information.avro")
    val msgReader = new
SpecificDatumReader[AvroTestRecord](classOf[AvroTestRecord])
    val fileReader = new DataFileReader[AvroTestRecord](file, msgReader)
    var m: AvroTestRecord = null

    while (fileReader.hasNext()) {
        m = fileReader.next(m)

        // This is the part that I'm uncertain about:
        val specific_msg = m.specific_info.asInstanceOf[SpecificRecordBase]
        val specific_msg_fields =
specific_msg.getSchema.getFields.map(_.name)

        if( specific_msg_fields.contains("pressure") ) {
            println("Must be Type1Data or Type2Data: " +
specific_msg.get("pressure"))
        }
        else {
            println("Must be Type3Data!")
        }
    }
}

Is this the most elegant method for reading in 'union' data? If this is the
right way to do this, then I'd like to propose a small patch (that I would
be willing to write). Getting the list of fields from a
'SpecificRecordBase' object is a bit cumbersome. I'd like to add a
'getFieldNames' method that would allow me to change this line:

val specific_msg_fields = specific_msg.getSchema.getFields.map(_.name)

to this line:

val specific_msg_fields = specific_msg.getFieldNames

Does this seem like a worthwhile improvement to usability? Let me know if
you would like me to put together a patch or file an issue.

Thanks!

Mime
View raw message