pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1518) multi file input format for loaders
Date Sat, 14 Aug 2010 01:06:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898490#action_12898490

Yan Zhou commented on PIG-1518:

There is a bigger question at hand. The semantics of OrderedLoadFunc is that the splits are
totally ordered. And BinStorage, InterStorage and PigStorage all implement that interface
through FileInputLoadFunc. Since the combination of splits as conceived here will definitely
destroy the split ordering, if the combination is disabled for these storages, the feature
would be virtually useless for a majority of use cases.

On the other hand, I'm seeing no use of the comparison capability except for MergeJoinIndexer's
getNext() method, which makes me wonder if the OrderedLoadFunc can be removed from the FileInputLoadFunc.
 Semantically, FileInputLoadFunc should not support the ordering of splits, as Hadoop's FileInputFormat
doesn't. When a need arises like in MergeJoinIndexer, we can add that extension on. But the
change may incur some backward compatibility issues.
I'm now soliciting comments in this area.

> multi file input format for loaders
> -----------------------------------
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
> We frequently run in the situation where Pig needs to deal with small files in the input.
In this case a separate map is created for each file which could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple files and
use them in a single split. We would like to see this working with different data formats
if possible.
> There are already a couple of input formats doing similar thing: MultifileInputFormat
as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message