hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
Date Thu, 09 Apr 2009 20:18:13 GMT

    [ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697634#action_12697634
] 

David Ciemiewicz commented on PIG-760:
--------------------------------------

Also, one other thing, along with the schema file, it would be useful to have a header file
written out as well.

While the serialized schema file might contain:

{ a: int, b: int }

the serialized header file would contain:

a<tab>b<newline>

based on what ever the separator value for PigStorage() happens to be.

This way when people need to take results and import them into Excel, R, JMP, or other systems
for analysis, the could just do:

hadoop fs -cat results/.header results/* > results.txt

and results.txt would contain:

a       b
1       2
2       3

greatly simplifying the process without having to create a schema to headers conversion program.

> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> I'm finding PigStorage() really convenient for storage and data interchange because it
compresses well and imports into Excel and other analysis environments well.
> However, it is a pain when it comes to maintenance because the columns are in fixed locations
and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a .schema file
stored with the data and if store PigStorage() could store a .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore
a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message