pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2237) LIMIT generates wrong number of records if pig determines no of reducers as more than 1
Date Wed, 24 Aug 2011 23:01:29 GMT

    [ https://issues.apache.org/jira/browse/PIG-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090601#comment-13090601
] 

Daniel Dai commented on PIG-2237:
---------------------------------

This is because SampleOptimizer will change the parallel size for "order by" according to
input size, at this time, LimitAdjuster already determined whether or not to add one additional
limit job. We need to do LimitAdjuster after SampleOptimizer.

> LIMIT generates wrong number of records if pig determines no of reducers as more than
1
> ---------------------------------------------------------------------------------------
>
>                 Key: PIG-2237
>                 URL: https://issues.apache.org/jira/browse/PIG-2237
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Anitha Raju
>            Assignee: Daniel Dai
>             Fix For: 0.9.1, 0.10
>
>
> Hi,
> For a script
> ========
> A = load 'test.txt' using PigStorage() as (a:int,b:int);
> B = order A by a ;
> C = limit B 2;
> store C into 'op1' using PigStorage();
> ========
> Limit and ORDER BY are done in the same MR job if no explicit PARALLELism is mentioned.
> In this case, the no of reducers are determined by pig and sometimes it is calculated
> 1.
> Since limit happens at the reduce side, each reduce tasks does a limit separately generating
n*2 records where n is the no of reduce tasks calculated by pig.
> If an explicit specification of no of reduce tasks using PARALLEL keyword is done on
ORDER BY,
> ==========
> B = order A by a PARALLEL 4;
> ==========
> another MR is created with 1 reduce task where the limit is done. 
> In short, the issue occurs when the no of reducers calculated by pig is greater than
1 and a limit is involved in the MR.
> The issue can be replicated by specifying
> ==========
> -Dpig.exec.reducers.bytes.per.reducer
> ==========
> The issue is seen in 0.8 and 0.9 version. It works good in 0.7
> Regards,
> Anitha

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message