hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mrid...@yahoo-inc.com>
Subject Re: [jira] Commented: (PIG-210) Column store
Date Tue, 22 Apr 2008 15:45:37 GMT

Support for HBase in pig would solve this imo without needing udfs by 
users ?

- Mridul

Pi Song (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/PIG-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590710#action_12590710
> Pi Song commented on PIG-210:
> -----------------------------
> I had started looking at this a bit before I switched to something else so I want to
share what I think with you.
> Like what I said in the other post, Pig is not a DBMS so it doesn't handle changes in
data well. For data mining purpose, it's still useful that in some use cases you just want
to take a snapshot of data and then try to explore it in different dimensions. Having column
based file store will really help reduce the amount of data in the process.
> A way to go is to implement this somewhere around LOAD/STORE/PigInput/PigOutput. Primarily
your data will be in some forms which is not column based so the first thing we do is taking
the source data files and use Pig to process it to column based files (using a special PigOutput).
Then, later we can selectively read only columns we need through a special PigInput to a data
mining operator (Well, at least in the future we must have CUBE. None of them exists at the
moment). LOLoad has to be changed a bit to allow you to select only columns you need.
> I got stuck before due to the way Hadoop generates output filenames.  At the time, I
didn't really spend much time to explore but I believe there will be a way out. If anyone
is interested please have a discussion. I should be more free next month and will get back
to this again,
>> Column store
>> ------------
>>                 Key: PIG-210
>>                 URL: https://issues.apache.org/jira/browse/PIG-210
>>             Project: Pig
>>          Issue Type: New Feature
>>          Components: data
>>            Reporter: John DeTreville
>> I believe that Pig stores its tables in row order, which is less efficient in space
and time than column order in a data-mining system. Column stores can be more highly compressed,
and can be read and written faster. It should be possible for clients to store their tables
in column order.

View raw message