hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charlie Groves (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-55) Allow user control over split creation
Date Tue, 15 Jan 2008 19:05:34 GMT

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559177#action_12559177
] 

Charlie Groves commented on PIG-55:
-----------------------------------

The openDFS removal was accidental.  I should be able to add it back by adding something to
get at the JobConf from PigSplitWrapper

I don't like that split returns a RecordReader where the first field is unused either, but
the dependency on Hadoop is locked in deeper than that.  PigSplitFactory needs to take a JobConf
so its subclasses can get at the right HDFS to lookup the files in location it needs to split.
 PigSplit itself extends InputSplit, another hadoop class, so if we're removing any references
to hadoop, we'd need to make an interface like InputSplit that exposes getLength and getLocations
since those things can't be figured out externally from the split.  We'd also need to have
some concept like Writable so the split can be sent over the wire.  The RecordReader interface
returned by the split has the same problem:  getPos, close, and getProgress need to be handled
by user code and can't be inferred by pig.  I feel like the complexity added to make interfaces
that are really similar to hadoop's is worse than the loss of generality from using hadoop's
interfaces, especially when the outmost layer of code, PigSplitFactory, is going to need access
to one hadoop class no matter what.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from
pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access
to a single input stream at a time.  To handle this usage, I've broken the existing split
creation code out into a few classes and interfaces, and allowed user specified load functions
to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message