avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: Collecting union-ed Records in AvroReducer
Date Thu, 08 Dec 2011 17:45:38 GMT

On 12/8/11 4:10 AM, "Andrew Kenworthy" <adwkenworthy@yahoo.com> wrote:

>is it possible to write/collect a union-ed record from an avro reducer?
>I have a reduce class (extending AvroReducer), and the output schema is a
>union schema of record type A and record type B. In the reduce logic I
>want to combine instances of A and B in the same datum, passing it to my
>Avrocollector. My code looks a bit like this:

If both records were created in the reducer, you can call collect twice,
once with each record.  Collect in general can be called as many times as
you wish.

If you want to combine two records into a single datum rather than emit
multiple datums, you do not want a union, you need a Record.  A union is a
single datum that may be only one of its branches in a single datum.

In short, do you want to emit both records individually or as a pair?  If
it is a pair, you need a Record, if it is multiple outputs or either/or,
it is a Union.

>Record unionRecord = new GenericData.Record(myUnionSchema); // not legal!
>unionRecord.put("type A", recordA);
>unionRecord.put("type B", recordB);
>but GenericData.Record constructor expects a Record Schema. How can I
>write both records such that they appear in the same output
> datum?

If your output is either one type or another, see Doug's answer.

for multiple datums, it is

output schema is a union of two records  (a datum is either one or the
["RecordA", "RecordB"]
then the code is:


If you want a single datum that contains both a RecordA and a RecordB you
need to have your output schema be a Record with two fields:

{"type":"record", "fields":[
  {"name":"recordA", "type":"RecordA"},
  {"name":"recordB", "type":"RecordB"}

And you would use this record schema to create the GenericRecord, and then
populate the fields with the inner records, then call collect once with
the outer record.

Another choice is to output the output be an avro array of the union type
that may have any number of RecordA and RecordB's in a single datum.


View raw message