hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
Date Fri, 11 Dec 2009 02:08:18 GMT

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1144:
----------------------------

    Attachment: PIG-1144-2.patch

I think the reason is quantile job need to know how many reducers we are going to use in order
to decide tuples to write into quantilesFile. The number of reducers is a constant field of
the plan. We cannot say -1 and let hadoop to decide the parallelism later. The fix actually
take default_parallel as the constant if user do not use PARALLEL key word. It applies to
both order by and skew join. Merge join and FRJoin are map only and regular join has been
taken care of in the original code. Attach the patch again, nothing change except for including
a new test case for skew join.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch,
PIG-1144-2.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set
default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism "
+ mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort,
runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front
of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message