hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-364) Limit return incorrect records when we use multiple reducer
Date Fri, 12 Sep 2008 20:29:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630673#action_12630673
] 

Shravan Matthur Narayanamurthy commented on PIG-364:
----------------------------------------------------

The patch seems to follow the approach mentioned, but I think there is a problem with this
approach. I tested it both with and without the patch. The problem exists in both cases. Consider
the following script:

A = load 'st10M';
B = limit A 10;
C = filter B by 2>1 parallel 10;
dump C;

I would expect to see only 10 tuples here. But I see 40 tuples here. The reason is that there
were 4 mappers that produced 40 tuples in all and there were 10 reducers asked by filter since
limit crosses map-reduce boundary. There was no limiting action done by the limit on the reduce
side as there is capacity to pass 10*10=100 tuples.

The problem is that we cannot guarantee the parallelism by setting the requested parallelism
to some value while visiting limit as it can get modified further down the line which is shown
in the example. The same case happens even in the limter in the reducer case for the following
script where I see 97 tuples instead of the expected 10. If everything went right (that is
no tuples are clipped) this should have been 100 and the actual output is pretty close:

A = load 'st10M';
B1 = group A by $0 parallel 10;
B = limit B1 10;
C = filter B by 2>1 parallel 10;
dump C;

One way I could think of is to terminate the new reduce phase created with a store therby
ensuring that the parallelism is ensured to be that set during limit. Then start a new MapReduceOper
by loading this file. Another way is to disable further changes to the reduce parallelism
by maintaining a flag which controls the changes to the parallelism of the map reduce operator.
But this would mean that we disobey the user's request, which might be meaningless at some
places. Also, the semantics of parallel will be distorted when the limit is in picture.

> Limit return incorrect records when we use multiple reducer
> -----------------------------------------------------------
>
>                 Key: PIG-364
>                 URL: https://issues.apache.org/jira/browse/PIG-364
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Daniel Dai
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: PIG-364.patch
>
>
> Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer,
we will get up to n*k output. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message