pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1518) multi file input format for loaders
Date Sat, 14 Aug 2010 21:57:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898648#action_12898648

Ashutosh Chauhan commented on PIG-1518:

This feature of combining multiple splits should honor OrderedLoadFunc interface. If loadfunc
is implementing that interface, then splits generated by it should not be combined. However,
its not clear why FileInputLoadFunc implements this interface. AFAIK, split[] returned by
getsplits() on FileInputFormat makes no guarantees that underlying splits will be returned
in ordered fashion. Though, it is a default behavior right now and thus making it implement
OrderedLoadFunc doesnt result in any problem in current implementation. But it seems there
is no real benefit of FileInputLoadFunc needing to implement it (there is one exception to
which I will come later on). So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc.
This will result in immediate benefit of making this change useful to all the fundamental
storage mechanisms of Pig like PigStorage, BinStorage, InterStorage etc. Dropping of an interface
by an implementing class  can be seen as backward incompatible change, but I really doubt
if any one cares if PigStorage is reading splits in an ordered fashion. 
Only real victim of this change will be MergeJoin which will stop working with PigStorage
by default. But we have not seen MergeJoin being used with PigStorage at many places. Second,
its anyway is based on assumption of FileInputFormat which may choose to change behavior in
future. Third, solution of this problem will be straight forward that having other Loader
which extends PigStorage and implements OrderedLoadFunc which can be used to load data for
merge join. 

In essence I am arguing to drop OrderedLoadFunc interface from FileInputLoadFunc so that this
feature is useful for large number of usecases.

Yan, you also need to watch out for ReadToEndLoader which is also making assumptions which
may break in presence of this feature.

> multi file input format for loaders
> -----------------------------------
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
> We frequently run in the situation where Pig needs to deal with small files in the input.
In this case a separate map is created for each file which could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple files and
use them in a single split. We would like to see this working with different data formats
if possible.
> There are already a couple of input formats doing similar thing: MultifileInputFormat
as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message