pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charlie Groves (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (PIG-55) Allow user control over split creation
Date Sun, 02 Mar 2008 23:51:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574311#action_12574311

groves edited comment on PIG-55 at 3/2/08 3:50 PM:

Updates my previous patch to use the generic DataStorage classes instead of Hadoop specific
code.  This fixes issues a and b that you raised.  Individual backends will have to implement
something to create a Chunker and hook Chunks into their processing setup like PigInputFormat
and ChunkWrapper in the patch do for hadoop, but any implementations of Chunk and Chunker
should be backend agnostic.  As a bonus, PigChunker in the patch implements the default file
selection and LoadFunc processing that used to be in PigInputFormat, but it should be instantiable
by any backend so they can pick up normal Pig processing for free.

I think c and d are outside of the scope of this patch.  Both of those problems relate to
sharing code to process the actual bytes from a Chunk, and that can be built on top of this
change.  This is only concerned with exposing the determination of what files to read and
what code should read them to user code from pig.

I made some minor modifications to the DataStorage code to allow easier access to the properties
on an individual element as its actual type.  It seemed ridiculous to turn longs into strings
only to immediately turn them back into longs all over the place.

The patch passes all of the tests for me.  It's awesome to go away for a month, svn update,
and have the tests take a fifth the time to run that they used to.

      was (Author: groves):
    Updates my previous patch to use the generic DataStorage classes instead of Hadoop specific
> Allow user control over split creation
> --------------------------------------
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: pig_chunker_split.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from
pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access
to a single input stream at a time.  To handle this usage, I've broken the existing split
creation code out into a few classes and interfaces, and allowed user specified load functions
to be used in place of the existing code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message