hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1518) multi file input format for loaders
Date Fri, 27 Aug 2010 07:56:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903283#action_12903283

Ashutosh Chauhan commented on PIG-1518:

Sorry for being late on this now thats its committed. But I think you have gotten it other
way around. A CollectableLoadFunc is combinable but OrderedLoadFunc is not. Lets go over all
three interfaces:

* h4. CollectableLoadFunc: A loader implementing it must make sure that all instances of a
particular key is present in one split. If you combine splits of such a loader, it will still
remain CollectableLoadFunc because all instances of keys will still be in same split after
combination. It is dictating a property *within* a split. Thus, its combinable.
* h4. OrderedLoadFunc: OrderedLoadFunc insists that loader implementing it must read splits
in a well defined order. If you combine the splits, that order may not hold. You cant combine
splits for this loader. Its defining a property *across* multiple splits.
* h4. IndexableLoadFunc: Says that loader is indexable meaning given a key it will get you
as close as possible to that key. It inherently assumes data is sorted and index is built
for it. Your combined splits may not remain sorted anymore. You cant combine splits for this
interface either. Its defining a property *across* multiple splits.

If you agree with above then PigStorage isnt combinable because 
public class PigStorage extends FileInputLoadFunc implements StoreFuncInterface,  LoadPushDown{}
public abstract class FileInputLoadFunc extends LoadFunc implements OrderedLoadFunc  {}

I also didnt get your logic for *CollectableLoadFunc AND a OrderedLoadFunc* It will help if
you can explain that a bit.

> multi file input format for loaders
> -----------------------------------
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>         Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch,
PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
> We frequently run in the situation where Pig needs to deal with small files in the input.
In this case a separate map is created for each file which could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple files and
use them in a single split. We would like to see this working with different data formats
if possible.
> There are already a couple of input formats doing similar thing: MultifileInputFormat
as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message