hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-364) Limit return incorrect records when we use multiple reducer
Date Thu, 18 Sep 2008 20:10:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632365#action_12632365

Shravan Matthur Narayanamurthy commented on PIG-364:

Consider the following script:
a = load 'file:/etc/passwd';
b = limit a 10;
c = filter b by 2>1 parallel 10;
split c into c1 if 2>1, c2 if 2>1;
d = group c1 by $0;
e = group c2 by $0;
f = group d by $0, e by $0;
dump f;

This is a case where, multiple MROps are generated at the split as shown in the figure below,
if what I understand from the code is right.


Now when the job controller sees this graph of MROps, it first schedules the LD MROp. To remind
you, the limitadjuster has now changed the output of this to some temporary file. At this
point, the controller has an option to schedule both the Lim Adj Op and the free 2-LRs Op
whose dependency has been just resolved. If at all the choice is to execute the 2-LRs oP it
tries to read the original output of the split which doesn't exist since the Lim Adj Op hasn't
run yet and will fail. However if it decides to choose the Lim Adj Op, things will go fine.

In order to avoid this, we need to make sure to disconnect all the successors and make the
Lim Adj Op their predecessor and connect Lim Adj Op to LD as indicated in the figure.

Let me know if I my understanding is wrong.

> Limit return incorrect records when we use multiple reducer
> -----------------------------------------------------------
>                 Key: PIG-364
>                 URL: https://issues.apache.org/jira/browse/PIG-364
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: types_branch
>         Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch
> Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer,
we will get up to n*k output. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message