hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
Date Tue, 22 Sep 2009 00:56:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758089#action_12758089
] 

Alan Gates commented on PIG-966:
--------------------------------



{quote}
I must not be clear on what pushing down to a loader does. My interpretation was that it allows
pushing down operations to the point where you don't read unnecessary data off disk. A classic
example of filter projection would be filtering by a partition key (so, dt >sysdate-30
, and our data is stored in files one per day). An example of projection pushdown is when
we have a column store that simply avoids loading some of the columns.

I don't see how a loader can push down a join. That seems to require reading and changing
data. Is the idea that such a join can be performed without an MR step? That seems like a
Pig thing, not a loader thing.

In any case, yes, I think something like this would require a new interface in the same namespace,
since it's a drastically different capability.

Any thoughts on advisability of simplifying projection pushdown to just work on an int array?
I know it's limiting, but it's going to be a heck of a lot easier for users to implement.
{quote}

Limiting the data you need to read off disk is partition pruning, or in the case of columnar
stores, column pruning.  But this isn't the only case in which you might want to push down
operators.  Consider
data that has (name, age, address) and is partitioned on name.  A user might want to query
only over adults (age > 17).  This isn't a partition field.  But if it's a columnar store
and age is compressed in
say run length or offset encoding the load function may be able to apply the filter on the
compressed data.  This can be a huge win, as we avoid decompressing whole rows that we don't
need.  To see another
case where we might want to push operators to the loader, consider the case where a user is
loading a set of Zebra files, all of which are sorted on one key.  Pig may want to keep those
zebra files
sorted.  It will need a way to tell the loader to merge those files as it loads them rather
than concatenate them and force Pig to resort the input.

I understand your concern on making it difficult to pass down just projection.  And you are
not the only one to express this concern.  Though even there for full projections, we need
more than a simple int array, so that we can
handle things like map, bag, etc. projections.  But maybe we need a simpler option for users
who just want to push projection and then the full blown option for power users who want to
push selection, etc.
Beginner and advanced interfaces I guess.



> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---------------------------------------------------------------
>
>                 Key: PIG-966
>                 URL: https://issues.apache.org/jira/browse/PIG-966
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces significantly.
 See http://wiki.apache.org/pig/LoadStoreRedesignProposal for full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message