avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Tabora <ratab...@gmail.com>
Subject Re: MapReduce: Using Avro Input/Output Formats without Specifying a schema
Date Thu, 01 May 2014 04:49:19 GMT
Wow not sure how I missed this, thank you! :)

Regards,
Ryan Tabora
http://ryantabora.com


On Wed, Apr 30, 2014 at 9:41 PM, Fengyun RAO <raofengyun@gmail.com> wrote:

> We also used AvroMultipleOutputs to deal with multiple schemas.
>
> the problem stands the same, you have to set a single mapper output
> type (or schema) before submitting the MR job. since there are
> multiple schemas, we used Schema.createUnion(List<Schema> types) as
> the mapper output schema.
>
> you could write a method to generate the list of schemas from the
> input data, before submitting the MR job.
>
> 2014-04-30 21:46 GMT+08:00, Ryan Tabora <ratabora@gmail.com>:
> > Thanks Rao, I understand how I could do it if I had a single schema
> across
> > all input data. However, my question is if my input data will vary and
> one
> > input could have a different schema from another.
> >
> > My idea would be to use something like MultipleOutputs or partitioning to
> > split up the output data by unique schema.
> >
> > I guess the question still stands, does anyone have any recommendations
> for
> > dynamically generating the schema using Avro output formats?
> >
> > Thanks,
> > Ryan Tabora
> > http://ryantabora.com
> >
> > On April 29, 2014 at 11:41:51 PM, Fengyun RAO (raofengyun@gmail.com)
> wrote:
> >
> > take MapReduce for example, which requires Runner, Mapper, Reducer
> >
> > the Mapper requires outputting a single Type (or a single Avro schema).
> >
> > If you have a set of CSV files with different schemas, what output type
> > would you expect?
> >
> > If all the CSV files share the same schema, you could dynamically create
> the
> > schema in the Runner before submitting a MR job.
> > If you look into the Schema.java, you would find create(),
> createRecord(),
> > etc. APIs.
> > you could simply read one CSV file head, and create the schema using
> these
> > APIs.
> > e.g.
> >     AvroJob.setMapOutputKeySchema(job,
> Schema.create(Schema.Type.STRING));
> > creates a schema with only a String field.
> >
> >
> >
> > 2014-04-30 4:56 GMT+08:00 Ryan Tabora <ratabora@gmail.com>:
> > Hi all,
> >
> > Whether you’re using Hive or MapReduce, avro input/output formats require
> > you to specify a schema at the beginning of the job or the table
> definition
> > in order to work with them. Is there any way to configure the jobs in a
> way
> > that the input/output formats can dynamically determine the schema from
> the
> > data itself?
> >
> > Think about a job like this. I have a set of CSV files that I want to
> > serialize into avro files. These CSV files are self describing and each
> CSV
> > file has a unique schema. If I want to write a job that scans over all of
> > this data and serialize it into avro I can’t do that with today’s tools
> (as
> > far as I know). If I can’t specify the schema up front, what can I do?
> Am I
> > forced to write my own avro input/output formats?
> >
> > The avro schema is stored within the avro data file itself, why can’t
> these
> > input/output formats be smart enough to figure that out? Am I
> fundamentally
> > doing something against the principles of the avro format? I would be
> > surprised if no one has run into this issue before.
> >
> > Regards,
> > Ryan Tabora
> >
> >
>
>
> --
> ----------------------------------------------------------------
> RAO Fengyun
> Center for Astrophysics, Tsinghua University
> Tel: +86 13810626496
> Email: raofengyun@gmail.com
>           rfy02@mails.tsinghua.edu.cn
> -----------------------------------------------------------------
>

Mime
View raw message