pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Li (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-2779) Refactoring the code for setting number of reducers
Date Mon, 16 Jul 2012 22:42:35 GMT

     [ https://issues.apache.org/jira/browse/PIG-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jie Li updated PIG-2779:

    Attachment: PIG-2779.1.patch

The latest PIG-2779.1.patch introduces the notion of runtimeParallelism, which is set to the
first positive number of parallel, default_parallel and estimated parallel.

For sampler jobs, we used to set #partitions at compile-time and reset it at runtime; this
patch will remove the compile-time setting and only keep the runtime setting. 

For the runtime setting of #partitions, we used to estimate based on the sampler's input;
this patch will instead estimate based on the next job's input, as for skew-join they are

For sampler's next job, e.g. order-by and skew join, we used to calculate their #reducers
independently from the sampler; this patch will instead calculate them together with the sampler,
so we can keep sampler's #partitions and the next job's #reducers synchronized.

> Refactoring the code for setting number of reducers
> ---------------------------------------------------
>                 Key: PIG-2779
>                 URL: https://issues.apache.org/jira/browse/PIG-2779
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Jie Li
>            Assignee: Jie Li
>             Fix For: 0.11
>         Attachments: PIG-2779.0.patch, PIG-2779.1.patch, TestNumberOfReducers.java, TestNumberOfReducers.java
> As PIG-2652 observed, currently the code for setting number of reducers is a little messy.
MapReduceOper.requestedParallelism seems being misused in some plases, and now we support
runtime estimation of #reducer which further complicates the problem.
> For example, if we specify parallel 1 for the order-by, the estimated #reducer will be
used. If we specify parallel 2 while it estimates 4, order-by will fail due to "Illegal partition
for Null". If we specify parallel 4 while it estimates 2, then some reducers will have nothing
to do. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message