pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1518) multi file input format for loaders
Date Tue, 17 Aug 2010 21:44:19 GMT

    [ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899605#action_12899605

Yan Zhou commented on PIG-1518:

One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM boxes is as


register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, (double)estimated_revenue;
B1 = distinct B;
alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as (name, phone,
        city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, B1 by user parallel 300;
D = group C by $0 parallel 40;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'spliCombo2.out';

It creates 3 map/reduce jobs.

No Split Combination:

|elapsed time|24s|2m43s|
|elapsed time|46s|3m11s|
|elapsed time|38s|53s|
|Total elapsed time|7m36s|

With Split Combination:

|elapsed time|22s|2m49s|
|elapsed time|27s|2m46s|
|elapsed time|17s|24s|
|Total elapsed time|7m5s|

> multi file input format for loaders
> -----------------------------------
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
> We frequently run in the situation where Pig needs to deal with small files in the input.
In this case a separate map is created for each file which could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple files and
use them in a single split. We would like to see this working with different data formats
if possible.
> There are already a couple of input formats doing similar thing: MultifileInputFormat
as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message