pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1518) multi file input format for loaders
Date Tue, 03 Aug 2010 01:44:17 GMT

    [ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894778#action_12894778
] 

Yan Zhou commented on PIG-1518:
-------------------------------

In contrast with Hive, where the CombineFileInputFormat is used to generate input splits on
the underlying storage formats, this PIG's combined splits work on top of the splits generated
by the underlying loaders. In other words, Hive's input splits are CombineFileSplits that
create record readers of underlying storage formats; while Pig's combined input splits contain
underlying storage's splits.

CombineFileRecordReader would have been reusable if not for its support only in 0.18 and the
need of  CombineFIleSplit as an argument to its constructor instead of InputSplit (MAPREDUCE-955).

> multi file input format for loaders
> -----------------------------------
>
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files in the input.
In this case a separate map is created for each file which could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple files and
use them in a single split. We would like to see this working with different data formats
if possible.
> There are already a couple of input formats doing similar thing: MultifileInputFormat
as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message