hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Huang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-833) Storage access layer
Date Wed, 19 Aug 2009 17:25:15 GMT

    [ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745140#action_12745140

Jing Huang commented on PIG-833:

Here is an example for different data types:
final static String STR_SCHEMA = "s1:bool, s2:int, s3:long, s4:float, s5:string, s6:bytes,
r1:record(f1:int, f2:long), r2:record(r3:record(f3:float, f4)), m1:map(string),m2:map(map(int)),
c:collection(f13:double, f14:float, f15:bytes)";

final static String STR_STORAGE = "[s1, s2]; [m1#{a}]; [r1.f1]; [s3, s4, r2.r3.f3]; [s5, s6,
m2#{x|y}]; [r1.f2, m1#{b}]; [r2.r3.f4, m2#{z}]";
On schema side, s1, s2....s6 are simple data type.   m2:map(map(int)): meaning m2 is a map
of map. m2's value is a map and this inner map's value is a int type.   (key is always string
 r2:record(r3:record(f3:float, f4)): meaning r2 is a record with one field which is a record
(r3). r3 is a record with two fields: f3: float and f4: default type bytes.

On storage side, i.e  [m1#{a}] meaning map m1 with key 'a' in this column group.  [s5, s6,
m2#{x|y}] meaning s5, s6 and map m2 with key 'x' or 'y' in this column group. [r2.r3.f4, m2#{z}]
meaning record r2's record r3 with field f4 and map m2 2ith key 'z' in this column group.


> Storage access layer
> --------------------
>                 Key: PIG-833
>                 URL: https://issues.apache.org/jira/browse/PIG-833
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jay Tang
>         Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2,
PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
> A layer is needed to provide a high level data access abstraction and a tabular view
of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval
code.  This layer should also include a columnar storage format in order to provide fast data
projection, CPU/space-efficient data serialization, and a schema language to manage physical
storage metadata.  Eventually it could also support predicate pushdown for further performance
improvement.  Initially, this layer could be a contrib project in Pig and become a hadoop
subproject later on.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message