hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
Date Tue, 27 Oct 2009 17:06:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770573#action_12770573
] 

Alan Gates commented on PIG-760:
--------------------------------

I know I'm wandering dangerously close to being fanatical here, but I really dislike taking
a struct, making all the members private/protected, and then adding getters and setters. 
If some tools need getters and setters, feel free to add them.  But please leave the members
public.

I notice you snuck in your names for LoadMetadata and StoreMetadata.  I'm fine with motions
to change the names.  But let's get everyone to agree on the new names before we start using
them.

On the StoreMetadata interface, Pradeep had some thoughts on getting rid of it, as he felt
all the necessary information could be communicated in StoreFunc.allFinished().  He should
be publishing an update to the load/store redesign wiki ( http://wiki.apache.org/pig/LoadStoreRedesignProposal
) soon.  He also wanted to change LoadMetadata.getSchema() to take a location so that the
loader could find the file.

Other changes all look good.  

One general thought.  I want to figure out how to keep the ResourceStatistics object flexible
enough that it's easy to add new statistics to it.  One thought I'd had previously (I can't
remember if we discussed this or not) was to add a Map<String, Object> to it.  That
way we can add new stats between versions of the object.  Once the stats are accepted as valid
and take hold, they could be moved into the object proper.  Upside of this is its flexible.
 Downside is we risk devolving into an unknown properties object and every stat has to go
through a transition.  Thoughts?

> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.6.0
>
>         Attachments: pigstorageschema-2.patch, pigstorageschema.patch
>
>
> I'm finding PigStorage() really convenient for storage and data interchange because it
compresses well and imports into Excel and other analysis environments well.
> However, it is a pain when it comes to maintenance because the columns are in fixed locations
and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a .schema file
stored with the data and if store PigStorage() could store a .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore
a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message