pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1518) multi file input format for loaders
Date Fri, 27 Aug 2010 14:18:54 GMT

    [ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903423#action_12903423
] 

Yan Zhou commented on PIG-1518:
-------------------------------

MergeJoinIndexer and IndexableLoadFunc are both not combinable.

Regarding orderedLoadFunc, the story is a bit more complex. First of all, it's only non-overriden
method, getSplitComparable, is only used in MergeJoinIndexer which is already not combinable.


The big issue is FileInputLoadFunc which is extended by BinStorage, PigStorage and InterStorage.
Semantically, I agree OrderedLoadFunc should not be combinable. However, FileInputFormat's
implementation of OrderedLoadFunc makes little sense in that its ordering is based on the
 (path, offset) pair. This is an ordering but just an arbitrary ordering. Mathematically one
can establish any arbitrary ordering over a discrete set of data. But the point is how is
the ordering used. For our purpose, the ordering should be related to some keys used in data
manipulation for which (path, offset) does not serve the purpose. Or implicitly a FileInputLoadFunc
still requires the storage gives out splits in some key ordering. If that storage ordering
does not actually exist, FileInputLoadFunc as an OrderedLoadFunc will have no use of its "sortness"
because the ordering is just, well, arbitray. The three extensions of FileInputLoadFunc work
on generic data storage. Unless they work on sorted data in general, they should not be an
OrderedLoadFunc.

The other use of OrderedLoadFunc, not its non-overriden method, getSplitComparable, is by
map-side cogroup. But it does not check if the sort key is the join key which is critical
for correctness.  It also requires to be a CollectableLoadFunc to work properly.

Since we do not want to break backward compatibility, and the only use of OrderLoadFunc in
Pig, except for MergeJinIndexer which is already excluded from combining, is in map side cogroup
with CollectableLoadFunc, I mark "CollectableLoadFunc AND an OrderedLoadFunc" as non-combinable.

In the future, we should really clean up the the OrderedLoadFunc from FileInputLoadFunc and
let the getSplitComparable method provide key-related info and not the (path, offset) pair.
Backward compatibility may need to be addressed too. Only then will the water become clearer
and I be ok to adjust the noncombinable setting accordingly.

> multi file input format for loaders
> -----------------------------------
>
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch,
PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files in the input.
In this case a separate map is created for each file which could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple files and
use them in a single split. We would like to see this working with different data formats
if possible.
> There are already a couple of input formats doing similar thing: MultifileInputFormat
as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message