hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
Date Thu, 09 Apr 2009 20:12:12 GMT

    [ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697631#action_12697631
] 

David Ciemiewicz commented on PIG-760:
--------------------------------------

Sure, you could do that, create PigStorageSchema.

The thing is, I don't think it is necessary and it is possible to do this in a "backward"
compatible way.

First, if the user specifies a LOAD ... AS clause schema, then PigStorage could simply use
that "casting" to override what is in the .schema.  Of course, PigStorage might want to warn
that there is an override at run time or do a "smart" warning only if there are incompatible
differences between the serialized schema and the explicit AS clause schema.

Next, is there really any harm in creating the serialized shema file on each and every STORE?

Finally, why sub class when we could parameterize?

In other words, instead of writing:

store A into 'file' using PigStorageSchema();

Why not do:

store A into 'file' using PigStorage('schema=yes');  -- redundant schema=yes is default

I think it would be more useful to have single classes with parameterized options than a proliferation
of classes.

Or, better yet, why can't I just define the behavior of PigStorage() for all of the instances
in my script:

define PigStorage PigStorage(
        'sep=\t',
        'schema=yes',
        'erroronmissingcolumn=no'
);

I have recently done similar things for other functions and it turns out to be a nice way
of capturing "global" parameterizations for cleaner Pig code.




> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> I'm finding PigStorage() really convenient for storage and data interchange because it
compresses well and imports into Excel and other analysis environments well.
> However, it is a pain when it comes to maintenance because the columns are in fixed locations
and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a .schema file
stored with the data and if store PigStorage() could store a .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore
a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message