hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Reed (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-55) Allow user control over split creation
Date Wed, 23 Jan 2008 22:56:35 GMT

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561856#action_12561856
] 

Benjamin Reed commented on PIG-55:
----------------------------------

Sorry, to take so long to comment. I was hoping to take a swipe at this, but I haven't been
able to get the time.

We cannot expose Hadoop classes in Pig. There are other backends that Pig runs on and we don't
want to pull all of Hadoop with us.

Antonio has a generalized file access layer PIG-32 that we should incorporate with. PigSplit
is an internal class specific to Hadoop, so we shouldn't expose that.

At a higher level, there is something else I would like to be able to do as well: multi file
splits. The notion that a split never spans a file is problematic when files are small. It
seems like we should be more flexible in that area. We also need fileless splits for load
functions that generate tuples "from thin air".

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from
pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access
to a single input stream at a time.  To handle this usage, I've broken the existing split
creation code out into a few classes and interfaces, and allowed user specified load functions
to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message