avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nishanth S <chinchu2...@gmail.com>
Subject Re: Writing to multiple AvroSchemas in MapReduce
Date Thu, 25 Jun 2015 22:23:13 GMT
The avro documentaion here says it is possible but doesnt  say how to
configure the Avrojob in the driver.

http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html

-Nishanth

On Thu, Jun 25, 2015 at 4:10 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:

> Looking at the example (http://avro.apache.org/docs/current/mr.html), I
> don't think it would be possible to configure multiple output schemas in
> one job. A JobConf can only set one writer schema with one output path (
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobConf.html).
> I believe it is required that all output data from a job has the same
> schema. I have not seen any use case where a map reduce job can have
> multiple output schemas.
>
>
> Sam
>
>
>
>
>   On Thursday, June 25, 2015 4:35 PM, Nishanth S <chinchu2884@gmail.com>
> wrote:
>
>
> Thank you Sam.I  am trying to read only one binary file in map reduce and
> split that into 4 avro files each having different schema.I am trying to do
>  this in one job but I am still not sure how to specify multipleoutput
> schemas to an Avrojob instance.Do we need to create multiple instances of
> Avrojob in the map reduce driver to do this?.
>
> Thanks,
> Nishan
>
> On Thu, Jun 25, 2015 at 2:53 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:
>
> If you process 4 files with schemas A, B, C, and D as the writer schemas,
> then I would assume that you would want to specify the reader schema using
> the setInput*Schema methods. Then you can set the writer schema with the
> methods that you are calling. To be clear all data processed by the job
> should have one reader schema that is determined when the data is read, and
> there should also be one writer schema (possibly different from the reader
> schema) when the data is written back to files. If you need to process the
> data from each schema independently, you should probably create one job for
> each schema.
>
> Disclaimer: I have never used the AvroJob interface directly; so this is
> just me inferring what I think it should do based on my experience with
> AvroStorage and the other language specific Avro interfaces.
>
> Hope this helps,
> Sam
>
>
>
>   On Thursday, June 25, 2015 12:53 PM, Nishanth S <chinchu2884@gmail.com>
> wrote:
>
>
>
> Hello All,
>
> We are using avro 1.7.7  and hadoop 2.5.1 in our project.We need to
> process a mixed mode binary file using map reduce and have the output as
> multiple avro files and each of these avro files would have different avro
> schemas.I looked at AvroMultipleOutputs class but did not completely
> understand  on what needs to be done in the driver class.This is a map only
> job the output of which should be  4 different avro files(which has
> different avro schemas) into different hdfs directories.
>
> Do we need to set all key and value avro schemas to Avrojob in driver
> class?
>
> AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.NULL));
> AvroJob.setOutputValueSchema(job, A.getClassSchema());
>
>
>
> Now if  I have schemas B,C and D  how would  these be set to
> AvroJob?.Thanks for  your help.
>
>
> Thanks,
> Nishan
>
>
>
>
>
>
>
>

Mime
View raw message