hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-760) Serialize schemas for PigStorage() and other storage types.
Date Thu, 22 Oct 2009 22:05:59 GMT

     [ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Dmitriy V. Ryaboy updated PIG-760:

    Attachment: pigstorageschema-2.patch

New patch to address findbugs and make the classes a little nicer to use.

Made internal fields protected, since having them public *and* having getters/setters didn't
really make sense.

Setters now return "this", so that they can be chained.

Array setters make a copy of the passed in array.  Getters return the internal array, so it's
still possible to shoot oneself in the foot (as findbugs points out), but side-effecting those
arrays is the intended use case.

Still flat-schemas only, haven't gotten around to wrestling the Jackson Parser on this one.
David -- do you need nested schemas?

Submitting as a patch so that Hudson can have a go. Would appreciate code comments, especially
with regards to the interfaces (and changes I made to them) from the Load/Store redesign proposal.

We probably want to hold off on commiting this until the new interfaces settle in a bit.

> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.6.0
>         Attachments: pigstorageschema-2.patch, pigstorageschema.patch
> I'm finding PigStorage() really convenient for storage and data interchange because it
compresses well and imports into Excel and other analysis environments well.
> However, it is a pain when it comes to maintenance because the columns are in fixed locations
and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a .schema file
stored with the data and if store PigStorage() could store a .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore
a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message