avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <raofeng...@gmail.com>
Subject Re: MapReduce: Using Avro Input/Output Formats without Specifying a schema
Date Wed, 30 Apr 2014 06:41:22 GMT
take MapReduce for example, which requires Runner, Mapper, Reducer

the Mapper requires outputting a single Type (or a single Avro schema).

If you have a set of CSV files with different schemas, what output type
would you expect?

If all the CSV files share the same schema, you could dynamically create
the schema in the Runner before submitting a MR job.
If you look into the Schema.java, you would find create(), createRecord(),
etc. APIs.
you could simply read one CSV file head, and create the schema using these
    AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
creates a schema with only a String field.

2014-04-30 4:56 GMT+08:00 Ryan Tabora <ratabora@gmail.com>:

> Hi all,
> Whether you’re using Hive or MapReduce, avro input/output formats require
> you to specify a schema at the beginning of the job or the table definition
> in order to work with them. Is there any way to configure the jobs in a way
> that the input/output formats can dynamically determine the schema from the
> data itself?
> Think about a job like this. I have a set of CSV files that I want to
> serialize into avro files. These CSV files are self describing and each CSV
> file has a unique schema. If I want to write a job that scans over all of
> this data and serialize it into avro I can’t do that with today’s tools (as
> far as I know). If I can’t specify the schema up front, what can I do? Am I
> forced to write my own avro input/output formats?
> The avro schema is stored within the avro data file itself, why can’t
> these input/output formats be smart enough to figure that out? Am I
> fundamentally doing something against the principles of the avro format? I
> would be surprised if no one has run into this issue before.
> Regards,
> Ryan Tabora

View raw message