avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Multiple input schemas in MapReduce?
Date Wed, 11 May 2011 23:46:27 GMT
Are the multiple schemas a series of schema evolutions?

That is, is there an obvious 'reader' schema, or are they disjoint?  If
this represents schema evolution, it should be possible (but may be a
current bug or limitation) to set the reader schema to the most recent
schema and resolve all files as that schema.

I currently run M/R jobs (but not using Avro's mapreduce package -- its a
custom Pig reader) over sets of Avro data files that contain a schema that
has evolved over time -- at least two dozen variants.  The reader uses the
most recent version, and we have been careful to make sure that our schema
has evolved over time in a way that maintains compatibility.

On 5/11/11 11:44 AM, "Markus Weimer" <weimer@yahoo-inc.com> wrote:

>I'd like to write a mapreduce job that uses avro throughout, but the map
>phase would need to read files with two different schemas, similar to
>what the MultipleInputFormat does in stock hadoop. Is this a supported
>use case? 
>A work-around would be to create a union schema that has both fields as
>optional and to convert all data into it, but that seems clumsy.
>Has anyone done this before?
>Thanks for any suggestion you can give,

View raw message