avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Groth <sgr...@yahoo-inc.com>
Subject Re: Writing to multiple AvroSchemas in MapReduce
Date Thu, 25 Jun 2015 20:53:01 GMT
If you process 4 files with schemas A, B, C, and D as the writer schemas, then I would assume
that you would want to specify the reader schema using the setInput*Schema methods. Then you
can set the writer schema with the methods that you are calling. To be clear all data processed
by the job should have one reader schema that is determined when the data is read, and there
should also be one writer schema (possibly different from the reader schema) when the data
is written back to files. If you need to process the data from each schema independently,
you should probably create one job for each schema.

Disclaimer: I have never used the AvroJob interface directly; so this is just me inferring
what I think it should do based on my experience with AvroStorage and the other language specific
Avro interfaces.

Hope this helps,

     On Thursday, June 25, 2015 12:53 PM, Nishanth S <chinchu2884@gmail.com> wrote:


Hello All,
We are using avro 1.7.7  and hadoop 2.5.1 in our project.We need to process a mixed mode
binary file using map reduce and have the output as multiple avro files and each of these
avro files would have different avro schemas.I looked at AvroMultipleOutputs class but did
not completely understand  on what needs to be done in the driver class.This is a map only
job the output of which should be  4 different avro files(which has different avro schemas)
into different hdfs directories.
Do we need to set all key and value avro schemas to Avrojob in driver class?
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.NULL));AvroJob.setOutputValueSchema(job,

Now if  I have schemas B,C and D  how would  these be set to AvroJob?.Thanks for  your


View raw message