pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1724) Multiquery optimization miscalculates the parallelism and results in extra 0 bytes files (Pig 0.7 and 0.8)
Date Tue, 16 Nov 2010 00:51:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932290#action_12932290
] 

Richard Ding commented on PIG-1724:
-----------------------------------

This is true that some resources is wasted on the clusters in this case. But you gain performance
improvement with multi-query optimization. You can disable multi-query optimization to get
around this issue. We can look into this issue in a future release. 

> Multiquery optimization miscalculates the parallelism and results in extra 0 bytes files
(Pig 0.7 and 0.8)
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1724
>                 URL: https://issues.apache.org/jira/browse/PIG-1724
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0, 0.8.0
>
>         Attachments: samplepig001.in
>
>
> We have found an issue with Pig 0.8 and Pig 0.7 when using Multiquery optimization. It
produces more number of part files than required. Please observe that the GROUP ALL is a dummy
in this case.
> {code}
> record002 = LOAD 'samplepig001.in' AS (id:chararray,num:int);
> f_records002= FILTER record002 BY num!=50000;
> group01 = GROUP f_records002 ALL PARALLEL 1;
> STORE group01 INTO 'pig_out_direc_SET1';
> set2 = FILTER f_records002 BY num!=200002;
> set2_Group = GROUP set2 ALL PARALLEL 1;
> STORE set2 INTO 'pig_out_direc_SET2';
> set3 = FILTER f_records002 BY num!=100001;
> set3_Group= GROUP set3 BY id PARALLEL 40;
> --set3_Rec4= FILTER set3_Group by num!=5000000;
> STORE set3_Group INTO 'pig_out_direc_SET3';
> {code}
> When run in Pig 0.8 it results in the following output.
> {quote}
> $ hadoop fs -ls /user/viraj/pig_out_direc_SET1
> ...
> Found 40 items
> rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET1/part-r-00000
> ...
> ...
> -rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET1/part-r-00039
> $ hadoop fs -ls /user/viraj/pig_out_direc_SET2
> Found 1 items
> -rw-------   3 viraj users        110 2010-11-13 02:08 /user/viraj/pig_out_direc_SET2/part-m-00000
> $ hadoop fs -ls /user/viraj/pig_out_direc_SET3
> Found 40 items
> -rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET3/part-r-00000
> ...
> ...
> -rw-------   3 viraj users          0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET3/part-r-00039
> {quote}
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message