hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charlie Groves (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-55) Allow user control over split creation
Date Wed, 09 Jan 2008 23:40:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557479#action_12557479

Charlie Groves commented on PIG-55:

Ahh, I left it under impl because it's something that only matters to the mapreduce portions
of pig, but I guess all user implementable interfaces are going to be in the top level pig

I can see what you're saying about PigSplitFactory getting too much information, but my next
step after this was going to be to figure out how to expose the actual fields used on the
loaded values to the split factory.  Since my data is broken out by columns, if I know the
accessed fields, I can only load the data necessary for those fields which will be a huge
speedup.  I was thinking I could extract that data from groupbySpec and evalSpec.  Is there
a better way to do this?

Regardless of that, the JobConf can be accessed from the PigContext, and the index value doesn't
bear any relevance outside of pig's internals, so I can drop those parameters.  I can also
remove the getEvalSpec, getGroupbySpec and getIndex methods from the PigSplit interface and
handle that internally without encumbering user created splits.  However, the PigSplit interface
can't go away altogether because the PigSplitFactory has to be able to return the actual splits
so they can handle the getLength and getLocations methods appropriately for the hdfs files
they're loading, and so they can create the actual RecordReader method with makeReader.  Since
that's particular to the style of loading the split factory is implementing, there's no way
to do it generically from pig.

Another patch forthcoming along these lines.

> Allow user control over split creation
> --------------------------------------
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from
pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access
to a single input stream at a time.  To handle this usage, I've broken the existing split
creation code out into a few classes and interfaces, and allowed user specified load functions
to be used in place of the existing code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message