pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1724) Multiquery optimization miscalculates the parallelism and results in extra 0 bytes files (Pig 0.7 and 0.8)
Date Mon, 15 Nov 2010 20:58:15 GMT

    [ https://issues.apache.org/jira/browse/PIG-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932191#action_12932191
] 

Richard Ding commented on PIG-1724:
-----------------------------------


This is by design. To quote multi-query functional spec:

{code}
What is the parallelism (the number of reduce tasks requested) of the merged splitter job?

How do we partition the keys of the merged inner plans?

After considering several partition schemes, we settled on this one:

    * The parallelism of the merged splitter job is the maximum of the parallelisms of all
splittee jobs.
    * The keys from inner plans are partitioned into all the buckets via the default hash
partitioner. 
{code}

> Multiquery optimization miscalculates the parallelism and results in extra 0 bytes files
(Pig 0.7 and 0.8)
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1724
>                 URL: https://issues.apache.org/jira/browse/PIG-1724
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0, 0.8.0
>
>         Attachments: samplepig001.in
>
>
> We have found an issue with Pig 0.8 and Pig 0.7 when using Multiquery optimization. It
produces more number of part files than required. Please observe that the GROUP ALL is a dummy
in this case.
> {code}
> record002 = LOAD 'samplepig001.in' AS (id:chararray,num:int);
> f_records002= FILTER record002 BY num!=50000;
> group01 = GROUP f_records002 ALL PARALLEL 1;
> STORE group01 INTO 'pig_out_direc_SET1';
> set2 = FILTER f_records002 BY num!=200002;
> set2_Group = GROUP set2 ALL PARALLEL 1;
> STORE set2 INTO 'pig_out_direc_SET2';
> set3 = FILTER f_records002 BY num!=100001;
> set3_Group= GROUP set3 BY id PARALLEL 40;
> --set3_Rec4= FILTER set3_Group by num!=5000000;
> STORE set3_Group INTO 'pig_out_direc_SET3';
> {code}
> When run in Pig 0.8 it results in the following output.
> {quote}
> $ hadoop fs -ls /user/viraj/pig_out_direc_SET1
> ...
> Found 40 items
> rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET1/part-r-00000
> ...
> ...
> -rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET1/part-r-00039
> $ hadoop fs -ls /user/viraj/pig_out_direc_SET2
> Found 1 items
> -rw-------   3 viraj users        110 2010-11-13 02:08 /user/viraj/pig_out_direc_SET2/part-m-00000
> $ hadoop fs -ls /user/viraj/pig_out_direc_SET3
> Found 40 items
> -rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET3/part-r-00000
> ...
> ...
> -rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET3/part-r-00039
> {quote}
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message