hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Reed (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-55) Allow user control over split creation
Date Mon, 10 Mar 2008 19:00:47 GMT

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577123#action_12577123
] 

Benjamin Reed commented on PIG-55:
----------------------------------

Great work Charlie! I like it. Couple of details:

1) PigContext isn't part of the public API, so it would probably be best if the LoadFunc was
passed to the Chunk rather than relying on the Chunk class to construct if needed.

2) It would be nice if the compressed handling could be done outside the Chunk class so that
programmers don't have to boiler plate it. (I'm not sure there is a nice way to do it, so
I'm fine with blowing this off for now.)

3) Javadoc is needed for the Chunk and Chunker classes. The interaction between the LoadFunc
and the Chunk/Chunker classes needs to be well documented.

4) You should put in a test case for a user defined Chunker and Chunk class. (When InputSplits
were first put into Hadoop, it worked for the builtin classes but failed for user defined
Splits).

Alan can you check this out? I'd like to commit this soon. I don't think it should effect
your pipeline work too much.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from
pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access
to a single input stream at a time.  To handle this usage, I've broken the existing split
creation code out into a few classes and interfaces, and allowed user specified load functions
to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message