pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-2579) Support for multiple input schemas in AvroStorage
Date Sat, 01 Sep 2012 00:55:07 GMT

     [ https://issues.apache.org/jira/browse/PIG-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Cheolsoo Park updated PIG-2579:

    Attachment: PIG-2579.patch

I updated the original Stan's patch re-basing it to trunk. While I kept the core logic unchanged,
I made some modifications as follows:
# Removed glob pattern related code as it's resolved in PIG-2492.
# Added an option 'multiple_schema' to AvroStorage. By default, AvroStorage assumes that all
the input files have the same schema, but if 'multiple_schema' is passed to load function,
it tries to merge every input schema.
# Allows multiple schemas with the same name. I use paths to identify schemas instead of their
# Refactored code.
# Added unit tests.

I think that the most arguable part is how to merge two different schemas into one. In shorts,
the rules are as follows:
# Different primitive types can be merged if certain conditions are met. Please see AvroStorageUtils.mergeType()
for more details.
# Only the same kind of complex types can be merged. e.g. record + record => ok, but record
+ array => error.
# For records, the union of fields is returned.
# For arrays/maps, their element types/value types are merged.
# For unions, the union of unions is returned.
# For fixeds, only the same size of fixeds can be merged.

It's easy to see in a unit test (TestAvroStorageUtils) what's expected when two schemas are

Please let me know if you have any questions/concerns.

> Support for multiple input schemas in AvroStorage
> -------------------------------------------------
>                 Key: PIG-2579
>                 URL: https://issues.apache.org/jira/browse/PIG-2579
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>    Affects Versions: 0.9.2, 0.11
>            Reporter: Stan Rosenberg
>            Assignee: Cheolsoo Park
>            Priority: Minor
>         Attachments: avro_storage_union_schema.patch, avro_storage_union_schema_test.tar.gz,
PIG-2579-avro_test_files.tar.gz, PIG-2579.patch
> This is a barebones patch for AvroStorage which enables support of multiple input schemas.
 The assumption is that the input consists of avro files having different schemas that can
be unioned, e.g., flat records.  
> A simple illustrative example is attached (avro_storage_union_schema_test.tar.gz): run
create_avro1.pig, followed by create_avro2.pig, followed by read_avro.pig.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message