pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corbin Hoenes <cor...@tynt.com>
Subject Re: [jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage
Date Fri, 28 Jan 2011 02:04:17 GMT
What about option a but return a map?

Sent from my iPhone

On Jan 27, 2011, at 5:01 PM, "Bill Graham (JIRA)" <jira@apache.org> wrote:

>    [ https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987839#action_12987839
> Bill Graham commented on PIG-1782:
> ----------------------------------
> Assigning this to myself, since I've got a working patch, but the design needs to be
vetted out further with this approach.
> One issue is that the number of columns per family per row is not constant, so with a
sparse table you'd have no idea what column names go with each value of the tuple returned.
Another issue is that the column name is actually dynamic descriptive data often times in
HBase and there can be multiple timestamped values for a cell.
> * Option A:
> Instead of returning a tuple of values the load can return a tuple of tuples. Each inner
tuple is a two-tuple that contains the column descriptor and the most recent value. This data
structure would be returned if a 'cf:' style column exists in the column list, but default
behavior exists with explicit column names. This is the simplest approach.
> * Option B:
> Build out an even more rich (and complex) data structure that also takes into account
multiple values and their timestamps. A tuple of tuple of tuple of tuples to capture the entire
HBase KeyValue data structure. Something like this:
> {code}
> (
> ( column name, ( (value, ts), ... ) ), ...
> )
> {code}
> Either way, the variable length tuples returned for each row containing additional variable
length tuples would probably require a number of custom UDFs to do anything useful with variable
name columns and multiple timestamped values. 
> I guess I lean towards option B so we can support more use cases down the road with this
refactor. Other opinions?
>> Add ability to load data by column family in HBaseStorage
>> ---------------------------------------------------------
>>                Key: PIG-1782
>>                URL: https://issues.apache.org/jira/browse/PIG-1782
>>            Project: Pig
>>         Issue Type: New Feature
>>        Environment: Java 6, Mac OS X 10.6
>>           Reporter: Eric Yang
>>           Assignee: Bill Graham
>> It would be nice to load all columns in the column family by using short hand syntax
>> {noformat}
>> CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
>> {noformat}
>> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in cpu
column family.
>> CpuMetrics would contain something like:
>> {noformat}
>> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
>> {noformat}
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.

View raw message