hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Guo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement
Date Tue, 18 Nov 2008 12:11:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648582#action_12648582

Samuel Guo commented on PIG-6:

My ideas about this issue.

** Load from  / Store into Table **

* Target *
Let pig have the ability to load from / store into the table in the bigtable-like systems
(such as, hbase, hypertable, and future maybe canssandra).

* Grammer *

<tableloadclause> := <LOAD> "TABLE" <tablepath> "PROJECTION" <projections_list>
AS <schema>
<tablestoreclause> := <STORE> <IDENTIFIER> "PROJECTION" <INTO> "TABLE"
<tablepath> <projections_list>
<projections_list> := <projection> ["," <projections_list>]
<projection> := "'"<string>:<string>:<string>"'"
<tablepath> := " ' " <string>:<string>" ' "

<tablepath> is formed by two part : "schema" and "tablename". "schema" identify the
system where the table is in. "schema" may be "hbase", "hypertable" or other system.
<projection> is formed by three part : "column_family_name", "column_name", "timestamp".

* Examples *

An example is below:

-- load the table 'table1' from 'hbase'
-- the operation project its "family1:column1" 's content at timestamp1 to field1
-- the operation project its "family2:" 's all contents at timestamp2 to field2
-- the operation project its "family3:" 's latest content to field3
A = Load table 'hbase:table1' projection 'family1:column1:timestamp1', 'family2::timestamp2',
'family3::' as (field1: chararray, field2: tuple, field3:tuple);

-- do some operation over A
B = ...A;

-- store B into 'hbase' as table 'table2'
-- projection B.$1 to 'family1:column1' with system's current timestamp
-- projection B.$2 to 'family2:column2' with timestamp v2
Store B projection into table 'hbase:table2' 'family1:column1:', 'family2:column2:v2';

* Data I/O over Table *

First, We need a custom datastorage to do the table data i/o.
Sth like:
Public interface TableDataStorage extends DataStorage {

The *TableDataStorage* will abstract all the bigtable-like system.


for Hbase, we can construct the hbase datastorage like:
public class HbaseDataStorage implements TableDataStorage {

for Hypertable, we may have a different datastorage like:
public class HypertableDataStorage implements TableDataStorage {

* MapReduce Stuff *

Because table is different from file, we may need a different slice interface. Sth like:

Public interface TableSlice extends Serializable {
        // get slice locations
	String[] getLocations();

        // init the data storage
	init(TableDataStorage store) throws IOException;
       // get the table's name
	byte[] getTableName();

        // get the start row of the table slice in this table.
	byte[] getStartRow();

        // get the end row of the table slice in this table.
	byte[] getEndRow();

        // get the cur row of the table slice in this table.
	byte[] getCurRow();

        // get the progress
	float getProgress() throws IOException;

        // get the next tuple.
	boolean next(Tuple) throws IOException;

And we need a related table slicer:

public interface TableSlicer {
	void validate(TableDataStorage store, String location) throws IOException;
	TableSlice[] slice(TableDataStorage store, String location) throws IOException;

Finally, the inputformat, outputformat, recordreader for map/reduce over table.

* PigTranslation *
Now, pig's translation can be divided into 3 steps.
First: parser -> logical plan;
Second: logical plan -> physical plan;
Last: physical plan -> map/reduce plan;

In the first two steps, we just need to add a similar operator as what file load/store do.
LOLoad -> LOTableLoad
POLoad -> POTableLoad

LOStore -> LOTableStore
POStore -> POTableStore

The difference is in the last step. 
When we are constucting a map/reduce job with a table load/store operation, we should use
the table's map/reduce related stuff (such as inputformat, outputformat and so on) to constuct
the job. And, the load/store between jobs just remain using temp files.

so a pig script using table load/store may be like:

source-table --> Job1(using table inputformat) -----> tempfiles(piginputformat/pigoutputformat)
-----> job2 -----> .... -----> jobN ------> target-table(using table outputformat)

* Other Problems *
There may be other optimization problems using table for data-processing. These problems are
not considering in the solution to make it clear.

Welcome for commets :-)

> Addition of Hbase Storage Option In Load/Store Statement
> --------------------------------------------------------
>                 Key: PIG-6
>                 URL: https://issues.apache.org/jira/browse/PIG-6
>             Project: Pig
>          Issue Type: New Feature
>         Environment: all environments
>            Reporter: Edward J. Yoon
> It needs to be able to load full table in hbase.  (maybe ... difficult? i'm not sure
> Also, as described below, 
> It needs to compose an abstract 2d-table only with certain data filtered from hbase array
structure using arbitrary query-delimited. 
> {code}
> A = LOAD table('hbase_table');
> or
> B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes & timestamp')
as (f1, f2[, f3]);
> {code}
> Once test is done on my local machines, 
> I will clarify the grammars and give you more examples to help you explain more storage
> Any advice welcome.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message