hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-210) Column store
Date Sat, 19 Apr 2008 15:52:21 GMT

    [ https://issues.apache.org/jira/browse/PIG-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590710#action_12590710

Pi Song commented on PIG-210:

I had started looking at this a bit before I switched to something else so I want to share
what I think with you.

Like what I said in the other post, Pig is not a DBMS so it doesn't handle changes in data
well. For data mining purpose, it's still useful that in some use cases you just want to take
a snapshot of data and then try to explore it in different dimensions. Having column based
file store will really help reduce the amount of data in the process.

A way to go is to implement this somewhere around LOAD/STORE/PigInput/PigOutput. Primarily
your data will be in some forms which is not column based so the first thing we do is taking
the source data files and use Pig to process it to column based files (using a special PigOutput).
Then, later we can selectively read only columns we need through a special PigInput to a data
mining operator (Well, at least in the future we must have CUBE. None of them exists at the
moment). LOLoad has to be changed a bit to allow you to select only columns you need.

I got stuck before due to the way Hadoop generates output filenames.  At the time, I didn't
really spend much time to explore but I believe there will be a way out. If anyone is interested
please have a discussion. I should be more free next month and will get back to this again,

> Column store
> ------------
>                 Key: PIG-210
>                 URL: https://issues.apache.org/jira/browse/PIG-210
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
> I believe that Pig stores its tables in row order, which is less efficient in space and
time than column order in a data-mining system. Column stores can be more highly compressed,
and can be read and written faster. It should be possible for clients to store their tables
in column order.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message