hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1518) multi file input format for loaders
Date Fri, 30 Jul 2010 23:47:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894205#action_12894205
] 

Yan Zhou commented on PIG-1518:
-------------------------------

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches small files
on the basis of block locality. For PIG, this umbrella input format will have to work with
the generic input formats for which the block info is not available but the data node and
size info are present to let the M/R make scheduling decisions.

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches small files
on the basis of block locality. For PIG, this umbrella input format will have to work with
the generic input formats for which the block info is unavailable but the data node and size
info are present to let the M/R make scheduling decisions. In other words, PIG can not
break the original splits to "work inside" but can just use the original splits as building
block for the combined input splits.

Consequently, this combine input format will be holding multiple generic input splits so that
each combined split's size is bound by a configured limit of, say, pig.maxsplitsize, with
the default value of the HDFS block size of the file system the load source sits in.

However, due to the constrains of sortness in the tables in merge join, the split combination
will not be used for any loads that will be used in merge join. For mapside cogroup or mapside
group by, though, the splits can be combined because the splits are only required to contain
the all duplicate keys per instance and combination of splits will still preserve that invariant.

During combination, the splits on the same data nodes will be merged as much as possible.
Leftovers will be merged without regarding to the data localities. Of all the used data nodes,
those of less splits will be merged before considering those of more splits so as to minimize
the leftovers on the data nodes of less splits. On each data node,  a greedy approach is adopted
so that largest splits are tried to be merged before smaller ones. This is because smaller
splits are easier merged later among themselves. 
As result, in implementation, a sorted list of data hosts (on the number of splits) of sorted
lists (on the split size) of the original splits will be maintained to efficiently perform
the above operations. The complexity should be linear with the number of the original splits.

Note that for data locality, we just honor whatever the generic input split's getLocations()
method produces. Any particular input split's implementation actually may or may not hold
that property. For instance, CombinedInputFormat will combine 
node-local or rack-local blocks into a split. Essentially, this PIG container input split
works on whatever data locality perception the underlying loader provides.

On the implementation side, PigSplit will not hold a single wrapped InputSplit instance but
a new CombinedInputSplit instance. Accordingly, PigRecordReader will hold a list
of wrapped record readers and not just a single one. Correspondingly PigRecordReader's nextKeyValue()
will use the wrapped record reader in order to fetch the next values.

Risks include 1) the test verifications may need major changes since this optimization may
cause major ordering changes in results; 2) since LoadFunc.prepareRead() takes a PigSplit
argument, there might be a backward compatibility issue as PigSplit changes its wrapped input
split to the combined input split. But this should be very unlikely as the only known
use of the PigSplit argument is the internal  "index loader" for the right table in merge
join.

> multi file input format for loaders
> -----------------------------------
>
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files in the input.
In this case a separate map is created for each file which could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple files and
use them in a single split. We would like to see this working with different data formats
if possible.
> There are already a couple of input formats doing similar thing: MultifileInputFormat
as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message