pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Jurney <russell.jur...@gmail.com>
Subject Fwd: Globbing several AVRO files with different (extended) schemes
Date Tue, 20 Mar 2012 21:54:35 GMT
Anyone interested in doing this?

---------- Forwarded message ----------
From: Scott Carey <scottcarey@apache.org>
Date: Tue, Mar 20, 2012 at 2:08 PM
Subject: Re: Globbing several AVRO files with different (extended) schemes
To: user@avro.apache.org

I'm assuming you are using Pig's AvroStorage function. It appears that it
does not support schema migration, but it certainly could do so.  A
collection of avro files can be 'viewed' as if they all are of one schema
provided they can all resolve to it.  I have several tools that do this
successfully with MapReduce/Pig/Hive.

The Pig AvroStorage tool is maintained by the Apache Pig project, you will
need to inquire there in order to get more details.


On 3/20/12 2:27 AM, "Markus Resch" <markus.resch@adtech.de> wrote:

>Hi guys,
>Thanks again for your awesome hint about sqoop.
>I have another question: The Data I'm working with is stored as AVRO
>Files in the Hadoop. When I try to glob them everything works just
>perfectly. But. When I add the schema of a single data file while the
>others remain everything gets wrecked:
>"currently we assume all avro files under the same "location"
>     * share the same schema and will throw exception if not."
>(e.g. I add a new data field) Expected behavior for me would be: If I'm
>globbing several files with slightly different schema the result of the
>LOAD would be either return an intersection of all valid fields that are
>common to both schemes or the atoms of the missing fields are nulled.
>How could I handle this properly?

Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message